How do you normalise read coverage in a BAM file?

Question

This is a question from the Oxford Nanopore community, from user Michael Radzieta:

I have sequenced some plasmids using the rapid barcoding kit, I have attempted to assemble the data using several programs (unicycler, spades, ra) but nothing has been able to assemble the plasmids I am expecting. I have read that if the coverage is too high, then it can negatively affect the assembly. Based on the approximate size of my plasmids and the number of filtered reads I have, the coverage is estimated at >5000.

Can anyone recommend a tool to subsample or normalise nanopore reads to a certain depth, such as 100x coverage?

https://bioinformatics.stackexchange.com/questions/402/how-can-i-downsample-a-bam-file-while-keeping-both-reads-in-pairs/5648#5648 — , Apr 13 '20 at 09:21

gringer · Answer 1 · 2023-12-14T09:58:01.337

I just updated one of my SAM-normalisation tools to more accurately normalise mapped reads to a target coverage. It uses reservoir sampling on a SAM-formatted file to choose reads to bring the coverage up to 100X (if possible).

https://gitlab.com/gringer/bioinfscripts/blob/master/samNormalise.pl

Example Usage:

samtools view -h input.bam | ~/scripts/samNormalise.pl | samtools view -b > normalised.bam

The output BAM file can be converted into a fastq file using 'samtools fastq':

samtools fastq normalised.bam > normalised.fastq

Here are some example before and after shots ['samtools mpileup input.bam'; 'samtools mpileup normalised.bam']:

Notice that where the coverage is below 100, the coverage is identical. Coverage is below 100 in some areas because some reads end, and no additional reads start until later on in the reference sequence.

The options are coded as variables at the beginning of the script at the moment, but I can easily convert them to command-line options in the future if there's demand for it.

This will work for long reads (i.e. PacBio, Nanopore), bearing in mind that there may be more low-coverage gaps if the reads are sparsely placed.

As implemented, this won't work properly with paired reads and will leave behind singletons.

Nice code for quick normalization gringer! Can you tell me how this deals with paired reads? Will it take this into account when sampling or will it leave behind singletons? — rhodo_ryan, Dec 12 '23 at 18:45
Sorry, as implemented, it won't work properly with paired reads and will leave behind singletons. — gringer, Dec 14 '23 at 09:56

How do you normalise read coverage in a BAM file?

1 Answers1