The tool you want to use accepts compressed files, so the first thing you should do is compress your file. I created a 111M file of random, 51bp sequences, each with a random, 52 character header and a random 51 nt sequence:
$ ls -lh file.fa
-rw-r--r-- 1 terdon terdon 111M Feb 9 12:21 file.fa
$ head file.fa
>YZF8SGLVBPCKMY18GJ94XTRCS61BB4OZKUZ61D 1J6R7H3 EJQA5 length=51
ACGAAGGGCATTTGGCAACGAACCTATACTATATCACGAGCAATTAGGCAG
>SPR0INGWKMILAXZZPF73BKRV73ELL1ROZSHYO1E9JX6P5X6MUZ20 length=51
TTAAGATTCCGAGAGAGTCTGTGGTTGGGTCCCTCTCCATAATGCTTTCAT
>U CB82L2M3BWJ9ZZZG3V6DL0Y0FTW8BXIFTU0TK7SL VN5QE42J6 length=51
TGGCTGGGAGGGATAGTAGCGAGATTATCCAGTCCGGTGTTCACAAGCCGG
>TGH62IM2TJK2QJTIVWGZR IVKOW94X7D52KGM7DJP61KKF81KPYA length=51
GCCGCAATACCAAAAAGATGGTGTGATTAGGAAATAAATAGGATCCAACAA
>L5VS9AVEF3UTEO3341STET092YEZ43ESVH32W7GI6Q27OIR NTXF length=51
CTATAGACGACTACGAACGACTGCAGGTGTGCCAGTTCTCGGTGAAACGGA
If we now compress that file:
$ tar cvzf file.fa.tar.gz file.fa
file.fa
$ ls -lh file.fa.tar.gz
-rw-r--r-- 1 terdon terdon 56M Feb 9 12:22 file.tar.gz
You will likely get a significantly better compression ratio since your headers won't be random and will likely be very repetitive.
If that isn't enough, you should try simplifying your headers. For instance, just use an increasing number:
$ perl -pe 's/^>.*/">" . ++$i/e' file.fa > file.short.fa
$ ls -lh file.short.fa
-rw-r--r-- 1 terdon terdon 58M Feb 9 12:23 file.short.fa
See what a huge difference that one change made? And if we now compress the file:
$ tar cvzf file.short.tar.gz file.short.fa
file.short.fa
$ ls -lh file.short.tar.gz
-rw-r--r-- 1 terdon terdon 19M Feb 9 12:24 file.short.tar.gz
Well below the threshold!
If compression isn't enough, you should make sure you have no duplicate sequences in the file. Use the FastaToTbl
and TblToFasta
scripts I have posted here and:
FastaToTbl file.fa | sort | uniq | TblToFasta > file.uniq.fa
Finally, if none of these helps enough, split your file into multiple smaller ones:
split -l 100000 file.fa
That will create multiple files of 100000 lines (so 50000 sequences, assuming all your sequences are one line) each.
.tar.gz
? What about the way to create a "tag" file like recommended here? – finswimmer Feb 08 '19 at 19:23