19

I’m using the RepBase libraries in conjunction with RepeatMasker to get genome-wide repeat element annotations, in particular for transposable elements.

This works well enough, and seems to be the de facto standard in the field.

However, there are two issues with the use of RepBase, which is why I (and others) have been looking for alternatives (so far without success):

  1. RepBase isn’t open data. Their academic license agreement includes a clause that explicitly forbids dissemination of data derived from RepBase. It’s unclear to what extent this is binding/enforceable, but it effectively prevents publishing at least some of the data I’m using and generating. This is unacceptable for open science.

    • Subordinate to this, the subscription model of RepBase also makes it impossible to integrate RepBase into fully automated pipelines, because user interaction is required to subscribe to RepBase, and to provide the login credentials.
  2. RepBase is heavily manually curated. This is both good and bad. Good, because manual curation of sequence data is often the most reliable form of curation. On the flip side, manual curation is inherently biased; and worse, it’s hard to quantify this bias — this is acknowledged by the RepBase maintainers.

Konrad Rudolph
  • 4,845
  • 14
  • 45
  • Were you only asking about defined repeat libraries? I interpreted it slightly more broadly as about tools used to build the libraries also (which becomes relevant when genomes from new taxa are sequenced) – Chris_Rands Jun 01 '17 at 12:29
  • 1
    @Chris_Rands Both (libraries and tools). Your answer is spot-on. – Konrad Rudolph Jun 01 '17 at 12:35
  • The goal is to build the annotated library of repetitions, or to mask repetitive parts of a genome? – Kamil S Jaron Jun 01 '17 at 14:19
  • 1
    @KamilSJaron I’m working with TEs, so I need the annotated library, not (merely) a repeat masked sequence. – Konrad Rudolph Jun 01 '17 at 14:24
  • 1
    Ouch, but for TEs just a subset of repetitive regions, there are specialised tools to annotate them (like DNApipeTE and REPET). Maybe you could specify it in the question. – Kamil S Jaron Jun 01 '17 at 14:32
  • @KamilSJaron Nice, and this might be worth an answer. I’ll also update teh question. That said, I am also asking for repetitive elements beyond TEs. – Konrad Rudolph Jun 01 '17 at 14:34
  • Also are you sure that RepBase is against publications that derive information based on their data? Could it be that they just don't want you sharing the raw data and files they provide. Since they have a good number of citations in different fields: https://scholar.google.de/scholar?um=1&ie=UTF-8&lr&cites=11574259945967474319 – story Jun 02 '17 at 14:47
  • 1
    @story They literally say so in the academic user agreement that I link to. Here’s the relevant quote: “You agree NOT to make the Repbase (or any part thereof, including Repbase Reports, Repeat Maps and other derived materials, modified or not) available to anyone outside your research group.” Emphasis mine. In fact, another clause in the agreement technically even forbids me from signing it because my institute requires public data deposition, so I’m probably not allowed to sign such agreements.” – Konrad Rudolph Jun 02 '17 at 14:49
  • Ya that seems to agree with my previous statement. I guess my point is what exactly did you need to share (based on your original post) that would be considered from their database? I feel like this wouldn't include counts of features but sequences might be an issue. – story Jun 02 '17 at 14:53
  • @story I need to potentially share all data that was used/generated in my analysis. This particularly includes the specific repeat annotation I used, which are derived from RepBase, as well as potentially sequence data from these repeats. – Konrad Rudolph Jun 02 '17 at 14:55
  • This might be an old question, but someone is trying to set-up a new, open, alternative to repBase (which i snow going full commercial), or at least that is how I perceive it: https://twitter.com/TransposableMan/status/1060519887897067521 – fridaymeetssunday Nov 08 '18 at 21:20

6 Answers6

14

Dfam has recently launched a sister resource, Dfam_consensus, whose stated aim is to replace RepBase. From the annoucement:

Dfam_consensus provides an open framework for the community to store both seed alignments (multiple alignments of instances for a given family) and the corresponding consensus sequence model.

Both RepeatMasker and RepeatModeler have been updated to support Dfam_consensus.

I haven’t tried it yet but it looks promising.

Konrad Rudolph
  • 4,845
  • 14
  • 45
7

For pre-existing reliabe TE libraries it is a bit of a mess, because not everybody deposits the species-specific TE libraries to a database like RepBase. And as far as I know DFAM contains only human resources, or am I wrong?

As for de novo generation of species-specific TE libraries (which should be done for any species not already present in eg. RepBase): There is no "gold-standard" how to tackle this best. In principle one has to think about two main parts -repeat detection -annotation

For repeat detection I would recommend using a combination of two things (which is necessary, because TE copies might miss in the assemblies as repetitive regions tend to be difficult to assemble and thrown away in the final assembly).

I) Repeat detection from raw reads (as with e.g. DNApipeTE or tedna or RepeatExplorer). For me, DNAPipeTE worked quite nicely, but everything has pros and cons. II) Repeat detection from assemblies (as with e.g. REPET or as mentioned before RepeatModeler)

Then the annotation of these repeats is tricky too, because most methods are relying on homology between the de novo TEs and the TEs from some (probably distantly) related species. But some programs take also structure into account (like REPCLASS). REPET can do both detection and annotation, but is a pain to get to run.

I would recommend using some programs to do de novo repeat detection on your species of interest on both the raw reads and assembly, clustering these libraries together (with e.g. uclust and 95% identity) and then run an annotation with homology and structural identification.

Probably the programs will not give you complete, full-length TEs but rather consensus sequences of several copies from TE families. If you want, you could search all copies of one family, extract them from the contigs plus boundaries and align them manually and curate boundaries manually. Then extend boundaries if not hitting the surrounding (non-alignable) regions or landmarks of TEs like LTRs or TIRs or so. But this is very time consuming if you only want to compare TE abundance between species for example, I would not do this and rather compare the abundance using read coverage (as in Bast et al. 2016). Depends all on the questions you want to ask.

Jens Bast
  • 71
  • 1
5

You could use RepeatScout, which has defined repeat libraries for a limited number of species (including human, mouse, and rat). If your taxon is not represented, you can also do de novo repeat prediction with RepeatScout to build your own library to feed to RepeatMasker. The RepeatScout publication includes some comparisons with RepBase. Another related tool is RepeatModeler, which wraps RepeatScout with RECON and some other programs, and shares authors with the RepeatMasker team.

On the plus side RepeatScout/RepeatModeler are open source and do not use manual curation, meeting your criteria. On the negative, I'm not sure exactly how RepeatModeler and the component tools are maintained. The RepeatScout web and github pages have not been updated for several years, although the RepeatModeler page shows its latest release was in 2017. Anyway, I know that some combination of RepeatScout/RepeatModeler have been used to annotate repeats for some fairly recent newly sequenced genomes, e.g. for cichlids, coelacanth, and Darwin's finch, so I think it's fair to say this kind of approach is accepted in the field, at least for vertebrate genome projects.

Chris_Rands
  • 3,948
  • 12
  • 31
4

AFAIK Dfam and Repbase are currently the two best sources of (a variety of) TE sequences.

In my genome annotations I have used RepeatModeler+RepeatMasker and then later used Repbase+tblastx and Dfam+nhmmer to classify them.

The classification process in my pipeline PhyLTR (https://github.com/mcsimenc/PhyLTR) is based on Dfam and Repbase. The process I used for LTR identification is

  1. Putative ID with LTRHarvest (based on structural sequence characteristics)
  2. Classification-by-homology to Repbase and Dfam
  3. Removal of elements without homology to sequences in Repbase or Dfam.

This results in a set of LTR-Rs which are full-length and have evidence that they are LTR-Rs.

matt
  • 184
  • 4
3

+1 for taking issue with RepBase.

I use the annotations from the Hammell Lab GTFs that they put out with TEtoolkit. It is similar to what you described to be using so this may be a redundant and useless answer but from the digging I’ve done they do seem to be comprehensive and well curated (for Drosophila, at least).

2

I know this question is a bit old, but this is still an issue for a lot of researchers not being able to access RepBase. It seems now that the most recent version of RepeatMasker is dependent upon RepBase for full functionality if masking anything other than human (currently DFAM only has human models). I recently discovered a de novo repeat masking approach called REpeat Detector (Red). This might be a solution for some looking to mask repeats on a genome assembly for annotation. The paper is here. I also then wrote a wrapper around Red to make it a bit more easy to soft-mask a genome, which you can find here.

One of the limitations with Red is that the repeats are not classified, so they are only identified. You would have to use some of the other tools mentioned above to try to classify them.

llrs
  • 4,693
  • 1
  • 18
  • 42
jpalmer
  • 21
  • 1