I have 2 .csv files that contains a list of accession codes. For example for this experiment a .csv file will contain:


Then I run the following

for i in $(cat *.csv); do echo ${i}; /usr/bin/sratoolkit/bin/fastq-dump --gzip ${i} & done
  1. Some of my downloads takes a lot of time days. Is there a faster way to get stuff from NCBI?
  2. Are there mirrors of NCBI to accelerate the process?
  3. Why is it so slow?
Answers to this question can be found on biostars. To give you a short summary:

  1. Faster way might be to use the parallel-fastq-dump, as suggested in this answer. I never tested that tool though, my own experience is that prefetch is more stable than fastq-dump command.

  2. In that same post it seems that the Japanese mirror seems faster, don't know if that is still valid.

  3. The reason this is never optimized might be that downloading from SRA is not a task you'll have to do repeatedly, as described here in this post.

You can try wget to download SRA files from NCBI server. This seems to be 2x faster in terms of CPU time than prefetch and has the advantage of not clogging your /home/ncbi/public/sra/ directory in case you have space limitation. The tutorial for wget downloading method is available at https://github.com/johanzi/ncbi_tutorial.

Here my timing of the prefetch method:

time prefetch SRR3300113
real    3m5.609s
user    0m6.332s
sys     0m2.076s

Versus the wget method:

time wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR330/SRR3300113/SRR3300113.sra
real    3m4.696s
user    0m0.456s
sys     0m3.332s

That makes a user+sys time of 8.402s for prefetch and 3.788s for wget. Not so much difference for real time though (about a second).

Johan Zicola
  • Hello Johan, welcome to BIoinformatics.SE. Is there a reason to think that wget would be faster? Intuitively I would guess that it's the server that is clogged, then using a different program to query the server does not sound like something that would help. (by the way, my inability to see how it could help does not mean that it does not :-) – Kamil S Jaron Apr 29 '19 at 15:56
  • Hi Kamil, the wget process is about 2x faster in terms of CPU time than prefetch method based on my test: – Johan Zicola Apr 30 '19 at 14:55
  • user + sys sort of tells you how long CPUs were active during the real time of execution (i.e. for user real is what matters, you can check this). So, what I would get out of your nice test is, that the wget is a more optimized tool, but both of them are fetching the data from exactly the same place and the limiting factor is the speed of their ftp. – Kamil S Jaron May 01 '19 at 07:47
    Yes. I would say the main advantage of wget is that you can choose where to put your downloaded SRA files while it goes in default into your home directory /home/user/ncbi/public/sra with prefetch. It can be a problem when you work on a server and you have limited disk usage for your home directory. – Johan Zicola May 01 '19 at 11:52
    You can choose where to put the cache, at least as of a few versions ago, via vdb-config. I agree it's a total pain that it defaults to the home directory instead of the current one, but it's a fixable problem. – GenesRus Mar 20 '21 at 02:53