1

I should get and analyse isomiR (isoform of miRNA) profile from RNA-seq with FASTA format. The best tool for this is "DeAnnIso". But here is a problem with this tool, it only allows you to upload FASTA format files with a maximum size of 50MB. However, the size of my FASTA file is more than 1000MB. I tried to compress the file in different ways and uploading, but it doesn't work. Does anybody know any way to solving this problem?

Or Does anybody know any other easy way to get isomiR profile from RNA-seq with FASTA format?

A sample of RNA-seq like this:

>RR42357.95441.1 I-D00105:274:H7MDXX:1:112:8124:865 length=51
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCTTGTAATCTCGTATGCC
>RR42357.95442.1 I-D00105:274:H7MDXX:1:112:8133:866 length=51
AAAGTGCTACTACTTTTGAGTCTTGGAATTCTCGGGTGCCAAGGAACTCCA
>RR42357.95443.1 I-D00105:274:H7MDXX:1:112:8144:8637 length=51

Thank you very very much!

terdon
  • 10,071
  • 5
  • 22
  • 48
user4091
  • 29
  • 2
  • Have you tried to compress your file into .tar.gz? What about the way to create a "tag" file like recommended here? – finswimmer Feb 08 '19 at 19:23
  • Have you tried cutting the sequence names down to integers instead of all the embedded information and then compressing them? – Bioathlete Feb 08 '19 at 22:32
  • 1
    In fact it looks like the header lines are longer than the sequence lines so I would expect a significant reduction in the file size. – Bioathlete Feb 08 '19 at 23:01
  • ^ this is funny.. but its true. microRNA isn't huge in length. I'm not exactly sure how 50 bp of sequence can take up 100 MB of data. – M__ Feb 08 '19 at 23:14

2 Answers2

1

The obvious answer is to break up the file size and upload each separately, First I'd check the file size,

wc infile.fa

Output

XXXXX YYYY ZZZZZ infile.fa

The number you want is XXXX - the number of lines in the file.

You'd then decide how many files you wanted to break it up into ... in your case divide XXXXX by 3. Then .. for example

sed -n '1,5000000p' infile.fa > outfile.fa
sed -n '5000001,2000000p' infile.fa > outfile2.fa
sed -n '2000001,4000000p' infile.fa > outfile3.fa

I'm not certain this would be cool however. An miRNA expert could answer this question very quickly. I think is you should check your data for sequence duplication. Simply piping the fasta file through a simple distance matrix and screening for "0" values would confirm this.

Creating a frequency table of duplicate RNA must be a feature of Bioconductor, but in any case screening for duplication (technically duplicate records) is a formal SQL database command.

It can be done a Perl/Python but its a bit clunky for a data set this size. I'd do it by setting the sequence as the key within a hash, the inefficiency has been solved (but its a weird way to do it).

You would have a database for "headers" for a given sequence, which would take up space, but I would assess the potential level of duplication first.

M__
  • 12,263
  • 5
  • 28
  • 47
1

The tool you want to use accepts compressed files, so the first thing you should do is compress your file. I created a 111M file of random, 51bp sequences, each with a random, 52 character header and a random 51 nt sequence:

$ ls -lh file.fa
-rw-r--r-- 1 terdon terdon 111M Feb  9 12:21 file.fa

$ head file.fa
>YZF8SGLVBPCKMY18GJ94XTRCS61BB4OZKUZ61D 1J6R7H3 EJQA5 length=51
ACGAAGGGCATTTGGCAACGAACCTATACTATATCACGAGCAATTAGGCAG
>SPR0INGWKMILAXZZPF73BKRV73ELL1ROZSHYO1E9JX6P5X6MUZ20 length=51
TTAAGATTCCGAGAGAGTCTGTGGTTGGGTCCCTCTCCATAATGCTTTCAT
>U CB82L2M3BWJ9ZZZG3V6DL0Y0FTW8BXIFTU0TK7SL VN5QE42J6 length=51
TGGCTGGGAGGGATAGTAGCGAGATTATCCAGTCCGGTGTTCACAAGCCGG
>TGH62IM2TJK2QJTIVWGZR IVKOW94X7D52KGM7DJP61KKF81KPYA length=51
GCCGCAATACCAAAAAGATGGTGTGATTAGGAAATAAATAGGATCCAACAA
>L5VS9AVEF3UTEO3341STET092YEZ43ESVH32W7GI6Q27OIR NTXF length=51
CTATAGACGACTACGAACGACTGCAGGTGTGCCAGTTCTCGGTGAAACGGA

If we now compress that file:

$ tar cvzf file.fa.tar.gz file.fa
file.fa
$ ls -lh file.fa.tar.gz 
-rw-r--r-- 1 terdon terdon 56M Feb  9 12:22 file.tar.gz

You will likely get a significantly better compression ratio since your headers won't be random and will likely be very repetitive.

If that isn't enough, you should try simplifying your headers. For instance, just use an increasing number:

$ perl -pe 's/^>.*/">" . ++$i/e' file.fa > file.short.fa
$ ls -lh file.short.fa 
-rw-r--r-- 1 terdon terdon 58M Feb  9 12:23 file.short.fa

See what a huge difference that one change made? And if we now compress the file:

$ tar cvzf file.short.tar.gz file.short.fa
file.short.fa
$ ls -lh file.short.tar.gz
-rw-r--r-- 1 terdon terdon 19M Feb  9 12:24 file.short.tar.gz

Well below the threshold!


If compression isn't enough, you should make sure you have no duplicate sequences in the file. Use the FastaToTbl and TblToFasta scripts I have posted here and:

FastaToTbl file.fa | sort | uniq | TblToFasta > file.uniq.fa

Finally, if none of these helps enough, split your file into multiple smaller ones:

split -l 100000 file.fa

That will create multiple files of 100000 lines (so 50000 sequences, assuming all your sequences are one line) each.

terdon
  • 10,071
  • 5
  • 22
  • 48