I have a pipeline for generating a BigWig file from a BAM file:
BAM -> BedGraph -> BigWig
Which uses bedtools genomecov
for the BAM -> BedGraph
part and bedGraphToBigWig
for the BedGraph -> BigWig
part.
The use of bedGraphToBigWig
to create the BigWig
file requires a BedGraph file to reside on disk in uncompressed form as it performs seeks. This is problematic for large genomes and variable coverage BAM files when there are more step changes/lines in the BedGraph
file. My BedGraph
files are in the order of 50 Gbytes in size and all that IO for 10-20 BAM files seems unnecessary.
Are there any tools capable generate BigWig
without having to use an uncompressed BedGraph
file on disk? I'd like this conversion to hapen as quickly as possible.
I have tried the following tools, but they still create/use a BedGraph
intermediary file:
deepTools
Some Benchmarks, Ignoring IO
Here are some timings I get for creating a BigWig
file from a BAM
file using 3 different pipelines. All files reside on a tmpfs
i.e. in memory.
BEDTools and Kent Utils
This is the approach taken by most.
time $(bedtools genomecov -bg -ibam test.bam -split -scale 1.0 > test.bedgraph \
&& bedGraphToBigWig test.bedgraph test.fasta.chrom.sizes kent.bw \
&& rm test.bedgraph)
real 1m20.015s
user 0m56.608s
sys 0m27.271s
SAMtools and Kent Utils
Replacing bedtools genomecov
with samtools depth
and a custom awk
script (depth2bedgraph.awk
) to output bedgraph
format has a significant performance improvement:
time $(samtools depth -Q 1 --reference test.fasta test.bam \
| mawk -f depth2bedgraph.awk \
> test.bedgraph \
&& bedGraphToBigWig test.bedgraph test.fasta.chrom.sizes kent.bw \
&& rm test.bedgraph)
real 0m28.765s
user 0m44.999s
sys 0m1.166s
Although it is has less features, we used mawk
here as it's faster than gawk
(we don't need those extra features here).
Parallelising with xargs
If you want to have a BigWig
file per chromosome, you can easily parallelise this across chromosome/reference sequences. We can use xargs
to run 5 parallel BAM->BedGraph->BigWig pipelines, each using the tmpfs
mounted /dev/shm
for the intermediary BedGraph
files.
cut -f1 test.fasta.chrom.sizes \
| xargs -I{} -P 5 bash -c 'mkdir /dev/shm/${1} \
&& samtools depth -Q 1 --reference test.fasta -r "${1}" test.bam \
| mawk -f scripts/depth2bedgraph.awk \
> "/dev/shm/${1}/test.bam.bedgraph" \
&& mkdir "./${1}" \
&& bedGraphToBigWig \
"/dev/shm/${1}/test.bam.bedgraph" \
test.fasta.chrom.sizes \
"./${1}/test.bam.bw" \
&& rm "/dev/shm/${1}/test.bam.bedgraph"' -- {}
deepTools
Let's see how deepTools
performs.
time bamCoverage --numberOfProcessors max \
--minMappingQuality 1 \
--bam test.bam --binSize 1 --skipNonCoveredRegions \
--outFileName deeptools.bw
real 0m40.077s
user 3m56.032s
sys 0m9.276s
<()
, FIFO buffers,/dev/stdin
). – gringer Jun 07 '17 at 10:25