17

I'm trying to download three WGS datasets from the SRA that are each between 60 and 100GB in size. So far I've tried:

  • Fetching the .sra files directly from NCBI's ftp site
  • Fetching the .sra files directly using the aspera command line (ascp)
  • Using the SRA toolkit's fastqdump and samdump tools

It's excruciatingly slow. I've had three fastqdump processes running in parallel now for approximately 18 hours. They're running on a large AWS instance in the US east (Virginia) region, which I figure is about as close to NCBI as I can get. In 18 hours they've downloaded a total of 33GB of data. By my calculation that's ~500kb/s. They do appear to still be running - the fastq files continue to grow and their timestamps continue to update.

At this rate it's going to take me days or weeks just to download the datasets. Surely the SRA must be capable of moving data at higher rates that this? I've also looked, and unfortunately the datasets I'm interested have not been mirrored out to ENA or the Japanese archive, so it looks like I'm stuck working with the SRA.

Is there a better way to fetch this data that wouldn't take multiple days?

Glorfindel
  • 175
  • 1
  • 2
  • 12
tfenne
  • 171
  • 1
  • 4
  • The SRA is definitely capable of moving this data at higher rates. The throttling happens not at their end but somewhere else in the network connection. In my test just now, using fastq-dump, I get a throughput of ~11.5 MiB/s, more than 20 times the rate you observe. – Konrad Rudolph Jun 03 '17 at 12:39
  • 1
    I don't know if this can help answering your question, but providing the commands you used might be at least useful for other readers. – bli Jun 05 '17 at 08:55
  • 1
    In general I find that its the conversion from SRA to fastq that is slow, not the downloading. You can try this by running NCBI's prefetch tool first to download the file (through using ascp if possible) and then fastq-dumping it. Also note that ascp works better when you limit the band width its allowed. We limit it to 100m/sec. – Ian Sudbery Jun 06 '17 at 09:44

6 Answers6

12

Proximity to NCBI may not necessarily give you the fastest transfer speed. AWS may be deliberately throttling the Internet connection to limit the likelihood that people will use it for undesirable things. There's a chance that a home network might be faster, but you're likely to get the fastest connection to NCBI by using an academic system that is linked to NCBI via a research network.

Another possibility is using Aspera for downloads. This is unlikely to help if bandwidth is being throttled, but it might help if there's a bit of congestion through the regular methods:

https://www.ncbi.nlm.nih.gov/public/

NCBI also has an online book about best practises for downloading data from their servers.

On a related note, just in case someone sees this and EBI/ENA is an option, there's a great guide for how to do file transfer using Aspera on the EBI web site:

https://www.ebi.ac.uk/ena/browse/read-download#downloading_files_aspera

Your command should look similar to this on Unix:

ascp -QT -l 300m -i <aspera connect installation directory>/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:<file or files to download> <download location>

In my case, I've just started downloading some files from a MinION sequencing run. The estimated completion time via standard FTP was 12 hours for about 32GB of data; ascp has reduced that estimated download time to about an hour. Here's the command I used for downloading:

ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/ERA932/ERA932268/oxfordnanopore_native/20160804_Mock.tar.gz .
gringer
  • 14,012
  • 5
  • 23
  • 79
6

In general, the best way to download SRA data is: don't download from SRA. However, as ENA has not be sync'd yet, I would recommend to download from SRA ftp and then convert to fastq locally. You can find files in the SRA format here. Downloading and then converting locally is much faster than direct retrieval from NCBI for some mysterious reasons.

user172818
  • 6,515
  • 2
  • 13
  • 29
  • 1
    Agree. I always download via ENA, it's reliably fast and you get the fastq files natively. I usually search with SRA as it's terrible on ENA. – ithinkiam Jul 13 '17 at 09:00
3

By far the fastest method in my experience has been to use the SRAdb library in R. For most entries, you can download fastq files directly. Some older experiments don't have them, but I've still found it much faster to download SRA files via getSRAfile() and then to convert them using fastqdump than to use fastqdump directly.

Sarah Carl
  • 362
  • 2
  • 11
2

NCBI is moving slowly to cloud (AWS/Google) for hosting. As such the tradiational FTP urls do not always work. My solution involves using pysradb.

An entire project can be downloaded by:

pysradb download -p <SRP_ID>

A Google Colab notebook with metadata retrieval commands is here.

rightskewed
  • 991
  • 8
  • 17
1

Another possibility in 2020 is to use fasterq-dump which is promoted to be faster than previous fastq-dump.

tlask
  • 11
  • 1
1

I recommend using fastq-dl: downloads from either ENA or SRA, multithreaded, automatic retries.

bricoletc
  • 161
  • 3