17

I have a pipeline for generating a BigWig file from a BAM file:

BAM -> BedGraph -> BigWig

Which uses bedtools genomecov for the BAM -> BedGraph part and bedGraphToBigWig for the BedGraph -> BigWig part.

The use of bedGraphToBigWig to create the BigWig file requires a BedGraph file to reside on disk in uncompressed form as it performs seeks. This is problematic for large genomes and variable coverage BAM files when there are more step changes/lines in the BedGraph file. My BedGraph files are in the order of 50 Gbytes in size and all that IO for 10-20 BAM files seems unnecessary.

Are there any tools capable generate BigWig without having to use an uncompressed BedGraph file on disk? I'd like this conversion to hapen as quickly as possible.

I have tried the following tools, but they still create/use a BedGraph intermediary file:

  • deepTools

Some Benchmarks, Ignoring IO

Here are some timings I get for creating a BigWig file from a BAM file using 3 different pipelines. All files reside on a tmpfs i.e. in memory.

BEDTools and Kent Utils

This is the approach taken by most.

time $(bedtools genomecov -bg -ibam test.bam -split -scale 1.0 > test.bedgraph \
  && bedGraphToBigWig test.bedgraph test.fasta.chrom.sizes kent.bw \
  && rm test.bedgraph)

real    1m20.015s
user    0m56.608s
sys     0m27.271s

SAMtools and Kent Utils

Replacing bedtools genomecov with samtools depth and a custom awk script (depth2bedgraph.awk) to output bedgraph format has a significant performance improvement:

time $(samtools depth -Q 1 --reference test.fasta test.bam \
  | mawk -f depth2bedgraph.awk \
  > test.bedgraph \
  && bedGraphToBigWig test.bedgraph test.fasta.chrom.sizes kent.bw \
  && rm test.bedgraph)

real    0m28.765s
user    0m44.999s
sys     0m1.166s

Although it is has less features, we used mawk here as it's faster than gawk (we don't need those extra features here).

Parallelising with xargs

If you want to have a BigWig file per chromosome, you can easily parallelise this across chromosome/reference sequences. We can use xargs to run 5 parallel BAM->BedGraph->BigWig pipelines, each using the tmpfs mounted /dev/shm for the intermediary BedGraph files.

cut -f1 test.fasta.chrom.sizes \
  | xargs -I{} -P 5  bash -c 'mkdir /dev/shm/${1} \
  && samtools depth -Q 1 --reference test.fasta -r "${1}" test.bam \
  | mawk -f scripts/depth2bedgraph.awk \
  > "/dev/shm/${1}/test.bam.bedgraph" \
  && mkdir "./${1}" \
  && bedGraphToBigWig \
    "/dev/shm/${1}/test.bam.bedgraph" \
    test.fasta.chrom.sizes \
    "./${1}/test.bam.bw" \
  && rm "/dev/shm/${1}/test.bam.bedgraph"' -- {}

deepTools

Let's see how deepTools performs.

time bamCoverage --numberOfProcessors max \
  --minMappingQuality 1 \
  --bam test.bam --binSize 1 --skipNonCoveredRegions \
  --outFileName deeptools.bw

real    0m40.077s
user    3m56.032s
sys     0m9.276s
Daniel Standage
  • 5,080
  • 15
  • 50
  • It's been on my to do list to remove the "make a single bedGraph file" intermediate step from deepTools. Given how that's all implemented there's no good way to avoid at least some intermediate file creation (without blowing up the memory at least). – Devon Ryan Jun 07 '17 at 09:41
  • 1
    I needed to generate the intermediate bedgraph files when I was creating BigWig files for GBrowse in 2011; it's disappointing to see that it hasn't changed much in the many years since then. The file seeking is a killer -- it excludes most of the shell tricks that can be used to work around programs that don't explicitly support the common UNIX pipeline (e.g. <(), FIFO buffers, /dev/stdin). – gringer Jun 07 '17 at 10:25
  • @DevonRyan when you say blow up memory, will that be equivilent to the size of the intermediary bedgraph file? I have a 720Gbyte fat memory compute node...so for me, speed is a bigger issue than memory. Perhaps an option to state the max memory usage, at which point you start going to disk? – Nathan S. Watson-Haigh Jun 07 '17 at 10:33
  • @NathanS.Watson-Haigh Yeah, it'd be equivalent more or less to the file size. deepTools itself was never engineered to make this easier to implement and the generally crappy nature of multithreading in python doesn't help that :( – Devon Ryan Jun 07 '17 at 11:08
  • @DevonRyan how about using the system's temp dir as the default location for the intermediary bedgraph file and expose this as a command line argument. That way I could use my raided SSD's for more convenient and faster IO. I was doing this with my bedtools and bedGraphToBigWig pipeline. – Nathan S. Watson-Haigh Jun 07 '17 at 11:17
  • The system temp dir (or shared memory) is already what's used for all of the tmp files. Honestly, the current implementation is a hold-over from before I wrote pyBigWig (or even joined the group) and deepTools was using bedGraphToBigwig in a system call. It's mostly a matter of me finding time to change how bamCoverage does this final step. – Devon Ryan Jun 07 '17 at 11:28
  • 1
    @DevonRyan is there any reason a named pipe as suggested by my answer wouldn't work for this? I have never used these tools so I can't be sure. – terdon Jun 07 '17 at 11:48
  • @terdon: Sorry, I was referring specifically to deepTools, which was mentioned in the post. One can certainly do what you've described, it'll just be slower than what deeptools can do (but better with IO, so maybe it's all a wash at that point!). – Devon Ryan Jun 07 '17 at 12:07
  • @terdon I don't see your named pipe based answer – bli Jun 07 '17 at 15:12
  • @bli no I deleted it since it won't work here because the question requires the ability to seek the file. FIFOs won't work with seeking AFAIK, so the answer wasn't useful. I hadn't realized the OP required a seekable file. – terdon Jun 07 '17 at 15:14

3 Answers3

11

This can be done in R very easily from an indexed .bam file.

Given single-end file for sample1.

library(GenomicAlignments)
library(rtracklayer)

## read in BAM file (use readGAlignmentPairs for paired-end files)
gr <- readGAlignments('sample1.bam')

## convert to coverages
gr.cov <- coverage(gr)

## export as bigWig
export.bw(gr.cov,'sample1.bigwig')

Be aware that this method doesn't include normalization steps (such as normalizing to total coverage). Most of these additional steps can be added if necessary.

story
  • 1,573
  • 1
  • 8
  • 15
  • 1
    I had hoped for a command line tool but I'm not totally against going into R if needs be. I need to test more extensively with regards to speed so I need to have more of a play. If tests show that it's slow for the number and sizes of BAM files I have, or a better command line answer materialises, I will accept this answer. – Nathan S. Watson-Haigh Jun 08 '17 at 04:15
4

Just for reference, there is a not very in depth, but first-hand reasoning (Jim Kent) given here about why bedGraphToBigWig does not support streaming http://genome.soe.ucsc.narkive.com/2S1Z3VpG/bedgraphtobigwig-reading-in-from-stdin

Edit: as noted wigToBigWig allows bedgraph streaming input via stdin but takes (possibly much) more memory than bedGraphToBigWig itself does

cat file.bedgraph | wigToBigWig stdin chrom.sizes output.bigwig

Note: You can also use wiggletools bin 10 for example to bin rows and downsample

Colin D
  • 186
  • 6
  • 1
    Ah! "However if you do have enough memory, you can still use wigToBigWig on bedGraphs, and it will fit in a pipe." – Nathan S. Watson-Haigh Aug 07 '17 at 21:16
  • That's sort of a weird decision. The zoom levels are written after the raw entries, so one can write all of the raw interval entries to disk and then do another pass over them when making the zoom levels. Then you don't need to store everything in memory. That's how libBigWig works, in fact. – Devon Ryan Aug 07 '17 at 21:17
2

The QuasR package in R has a useful function called qExportWig.

To use it you first need to create a sample sheet and a qProject object - this can be done starting with bam files or fastq files (in which case it also performs alignments). You can then create wig or bigWig files without writing a bedGraph file to disk.

An example would look something like this:

library(QuasR)

# Create qProject object using qAlign
# Sample file should contain tab-separated file names (BAM or fastq) and sample names
myproj <- qAlign("samples.txt", genome=mygenome, paired="no")

# Export all samples in project as bigWig, scaling to mean number of alignments by default
qExportWig(myproj, binsize=50, createBigWig=T)

It has a lot of flexible options, including scaling or log-transforming counts, shifting reads and changing the bin size.

Sarah Carl
  • 362
  • 2
  • 11