Finding transposable elements using RepeatMasker

Question

I'm using RepeatMasker to detect, classify the Transposable elements. My Input is a eukaryotic non-reference genome. I made a run via RepeatMasker many times to Mask the TEs, but return 0 Annotation tables. Further, I used a different -species option each time, but Annotation results contain only short and simple repeats.

Do I have to create a unique library to detect them?

The command I used:

./RepeatMasker -no_is -noint - species mammal Genome.fna

Results:

sequences:         32573
total length: 2004063690 bp  (1981588036 bp excl N/X-runs)
GC level:         41.28 %
bases masked:   33282890 bp ( 1.66 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:                0            0 bp    0.00 %
      ALUs            0            0 bp    0.00 %
      MIRs            0            0 bp    0.00 %

LINEs:                0            0 bp    0.00 %
      LINE1           0            0 bp    0.00 %
      LINE2           0            0 bp    0.00 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:         0            0 bp    0.00 %
      ERVL            0            0 bp    0.00 %
      ERVL-MaLRs      0            0 bp    0.00 %
      ERV_classI      0            0 bp    0.00 %
      ERV_classII     0            0 bp    0.00 %

DNA elements:         0            0 bp    0.00 %
     hAT-Charlie      0            0 bp    0.00 %
     TcMar-Tigger     0            0 bp    0.00 %

Unclassified:         0            0 bp    0.00 %

Total interspersed repeats:        0 bp    0.00 %


Small RNA:            0            0 bp    0.00 %

Satellites:           1          267 bp    0.00 %
Simple repeats:  663679     27953718 bp    1.39 %
Low complexity:  123705      6154714 bp    0.31 %
==================================================

is there any alternative or suggestions for a tool that accomplishes this task, scan the sequence and detect these Elements — BioInfo, Sep 09 '18 at 05:51
Welcome to Bioinformatics. Could [edit] the question to add your comment to the body of the question? Also could you clarify what do you mean with different species and options ? Are you studying a mammal organism or not? — llrs, Sep 10 '18 at 08:31
Maybe use something like RepeatScout to build a library for RepeatMasker, relevant https://bioinformatics.stackexchange.com/a/347/104 — Chris_Rands, Sep 10 '18 at 11:04
looks like this can a solution for your problem https://www.biostars.org/p/154290/ — Twilie2012, Sep 11 '18 at 11:08
what I meant by different species . its an option in Repeatmasker to look under a specific range of species. I'm studying eukaryotic sequence so I thought mammal species could be the closest. — BioInfo, Sep 17 '18 at 12:30
does the Transposable elements "equals" the repeats that give by the RepeatMasker, which also can be download by the UCSC table browser. — idewdewi, Mar 15 '21 at 20:25

story · Answer 1 · 2018-09-11T14:07:44.673

You need to model the repeats in your de novo genome.

See: http://www.repeatmasker.org/RepeatModeler/ and https://www.biostars.org/p/154290/ (from previous answer)

You first build a database named "name_of_your_database" (insert whatever you want to call it) and run it on the fasta file of your genome (e.g. your_genome.fasta).

Make you find the right paths for the software.

## build the database
BuildDatabase -engine ncbi -name name_of_your_database your_genome.fasta

## run the modeler (in this case using ncbi)
RepeatModeler -database name_of_your_database -engine ncbi

matt · Accepted Answer · 2019-02-27T16:00:26.193

2

That's one answer ^, not sure if you NEED to do that, which wouldn't actually classify your elements as LTRs, it would just find repetitive sequences in your new genome (as you discovered).

To get them classified one standard approach is to use evidence of homology to sequences of LTR elements with classifications already, and you can get those in Repbase. If you want to download Repbase, you have to sign up for an account with GIRI but I think they give to anyone who asks. Then you can download the section of Repbase you're interested in (e.g. the whole thing, LTR, DNA-TE, ...) and provide the path to the downloaded library (in FASTA format I think) using -lib, like

RepeatMasker -lib Repbase_LTR.fa genome.fa

Then your RepeatMasker output table should have info about other kinds of TEs.

If you want to check out my approach to classification by homology you might find some ideas you'd like to try. I used Dfam+nhmmer and Repbase+tblastx to identify evidence for LTR retrotransposon classification.

My software pipeline is available as open-source software. If you want detailed high-quality annotations of LTR retrotransponsons, PhyLTR will automatically annotate putative protein-coding sequence regions, classify them, and remove false-positives like a tandem array of DNA TEs which look like LTR-Rs (good diagram of LTR-R false positives here: https://github.com/oushujun/LTR_retriever/blob/master/Manual.pdf)

Phylogenetic Analysis of LTR retrotransposons https://github.com/mcsimenc/PhyLTR

edited Feb 27 '19 at 16:00

answered Feb 27 '19 at 15:53

matt

184
4

This is indeed an interesting approach. Do you have any comments on the actual homology or sequence conservation of different repetitive element "entities" across multiple species? Also wouldn't the approach you mention miss non-LTR TEs? LINE-1 elements are the most abundant TEs in humans but not have LTRs. – story Feb 28 '19 at 00:47
I'm not sure what you're thinking of exactly, but I can say that in the all plant genomes I have annotated using this approach resulted in clusters of "Gypsy" "Copia" elements where domain content and organization independently affirmed their classification as such. Also some lower-level "taxa" like Ogre elements were recognizable by domain architecture. This work is meant for LTR elements only. It would work fine for metazoan endogenous retroviruses and LTR retroviruses, but not for LINEs. I think the author of the post mentioned LTRs in the original post on stackoverflow. – matt Feb 28 '19 at 02:38
Dears : i appreciate these knowledge, alight me towards my projects , i have gone far in my analysis using ltrharvest + ltrdigest using 314 protein profiles from Gydb database and trna library obtained using tRnascan, then filtration candidates with protein hits only Remain, now i do have the coordinates and the sequences of the TE class1 , i m trying to classify them into subfamilies ex (copia,gypsy...etc) but i failed , as well as class 2 DNA, MITES and Helitron Results obtained with different pipeline but fail to classify them into subfamilies @matt – BioInfo Mar 05 '19 at 10:11
@story my project is to report all the Transposable Element in my Draft Genome Class1 LTR and NON-LTR . Class 2 Dna transposons , successfully the LTR structure easily can be detected by Denovo tools , but non-LTR must of the tools need windows operating system like MegaScan non-ltr any suggestions – BioInfo Mar 05 '19 at 10:24
@matt i did registered to download the database but couldnot find the way to specify only LTR or Dna it comes all together . i tried /// tBlastn - query LTR-filtered_digest.fasta - subject mem.ref -out results /// vrtb.ref its one of many files downloaded from repbase refer to vertebrate but no hits are found – BioInfo Mar 05 '19 at 10:34
@emaz, sounds good, nice work. The GIRI Browse page, https://www.girinst.org/repbase/update/browse.php, is place I was thinking for downloading the kinds of repeats separately. Other than that you might need to do some manual steps or scripting to mold those files how you want, e.g. cat mem.ref vrtb.ref >> repbase.fasta to join two text files. If I had the compute power I might look at all of them together. If I remember correctly, RepeatMasker gets taxon info/descriptors from the FASTA sequence header/ID by default. MGEscan-nonLTR - https://mgescan.readthedocs.io/en/latest/nonltr.html. – matt Mar 06 '19 at 13:33
@emaz, tblastx is the one to use, I believe. tblastn searches a translated nucleotide database (=protein seqs) using a protein query -- If you give nucleotide FASTA as query input it will still treat it like a protein, i.e. ACG = Alanine-Cysteine-Glycine. tblastx will ultimately do a protein protein search just like tblastn but it treats the query input as a DNA sequence and translates it for you. I'm curious if you will get some hits using tblastx. tlblastx is one of the slowest BLASTs because it translates both the query and subject (database) sequences in all six reading frames. – matt Mar 06 '19 at 13:38
^so tblastx is one of the most sensitive BLASTs to the kinds of haphazard sequence mutations that tend to happen in loci undergoing neutral evolution, like probably many TEs are, mostly; i.e. there will probably be a lot of frameshifts and the nucleotide seqs will change much faster than their encoded protein seqs. tblastx can find fragments of similar encoding sequences, which makes it helpful for comparing fragmented TE sequences, which most are. – matt Mar 06 '19 at 13:47
im so glad to have you here @matt so as i understand will run // tblastx -query ltr-sequences.fasta -subject mem_vert_combined.ref // to classify them into subfamilies ? – BioInfo Mar 06 '19 at 13:52
great, I'm glad to help. What your ^ command will do is run tblastx, which if you add -out OUTPUTFILENAME and -outfmt 7, will create a potentially very large tab-delimited file with information about which parts of each sequence have matches (high-scoring sequence alignments with) your input database sequences. You will probably need to do some manual scripting at this point to decide and add your classifications to your TE file(s). I have some Python scripts I have made for doing this kind of thing; if you get there and want some more help get a hold of me and I can send you the scripts. – matt Mar 06 '19 at 14:01
Let us continue this discussion in chat. – matt Mar 06 '19 at 14:05

score 1 · Answer 3 · answered May 30 '23 at 21:47

In case someone finds this years later: OP's problem was the use of the "-noint" flag. It stands for "no interspersed", so the OP explicitly asked RepeatMasker NOT to look for transposable elements. If they omitted the flag, RepeatMasker would have automatically downloaded mammal TE sequences (nowadays it uses Dfam). Even if they had a custom library like suggested above, they would still need to delete "-noint". Check your commands!

Finding transposable elements using RepeatMasker

3 Answers3