Update - as of January 2021, samtools can now do filtering based on an expression that includes tag variables. In this case, this expression can be used to exclude any reads that have either an XA or SA tag:
samtools view -b mapped.bam -e '!([XA] | [SA])' > unique_mapped.bam
For more details on the samtools expression parameter, see the samtools documentation:
https://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS
Original answer follows....
To exclude all possible multi-mapped reads from a BWA-mapped BAM file, it looks like you need to use grep
on the uncompressed SAM fields:
samtools view -h mapped.bam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b > unique_mapped.bam
Explanation follows...
I'm going to assume a situation in which a bioinformatician is presented with a mapped BAM file produced by BWA, and has no way of getting the original reads. One high-effort solution would be to extract the mapped reads from the BAM file and re-map with a different mapper that uses the MAPQ score to indicate multiple mappings.
... but what if that were not possible?
My understanding of BWA's output is that if a read maps perfectly to multiple genomic locations, it will be given a high mapping quality (MAPQ) score for both locations. Many people expect that a read that maps to at least two locations can have (at best) a 50% probability of mapping to one of those locations (i.e. MAPQ = 3). Because BWA doesn't do this, it makes it difficult to filter out multiply-mapped reads from BWA results using the MAPQ filter that works for other aligners; this is likely to be why the current answer on Biostars [samtools view -bq 1
] probably won't work.
Here is an example line from a BWA mem alignment that I've just made. These are Illumina reads mapped to a parasite genome that has a lot of repetitive sequence:
ERR063640.7 16 tig00019544 79974 21 21M2I56M1I20M * 0 0 TATCACATATCATCCGACTCAGCTCGACGAGTACAATGCTAATTTAACACTTAGAATGCCCGGCAATGAAATTCGTTTTCCGTCAATTCTTGAAAATTTC <AABBEGABFJKKKIM7GHKKJK>JLKLDGMHLIMIHHCGIJKKLJKLNJGLLLKLILKLMFNDLKGHJEKMKKMIJHGLOJLLLKIJLKKJEJLIGG>D NM:i:4 MD:Z:83A13 AS:i:77 XS:i:67 XA:Z:tig00019544,-78808,21M2I56M1I20M,6;tig00019544,-84624,79M1I20M,6;tig00019544,-79312,33M4I42M1I20M,8;
BWA mem has found that this particular read, ERR063460.7, maps to at least three different locations: tig00019544, tig00019544, and tig00019544. Note that the MAPQ for this read is 21, so even though the read maps to multiple locations, MAPQ can't be used to determine that.
However, the alternative locations are shown by the presence of the XA
tag in the custom fields section of the SAM output. Perhaps just filtering on lines that contain the XA
tag will be able to exclude multiply-mapped reads. The samtools view
man page suggests that -x
will filter out a particular tag:
$ samtools view -x XA output.bam | grep '^ERR063640\.7[[:space:]]'
ERR063640.7 16 tig00019544 79974 21 21M2I56M1I20M * 0 0 TATCACATATCATCCGACTCAGCTCGACGAGTACAATGCTAATTTAACACTTAGAATGCCCGGCAATGAAATTCGTTTTCCGTCAATTCTTGAAAATTTC <AABBEGABFJKKKIM7GHKKJK>JLKLDGMHLIMIHHCGIJKKLJKLNJGLLLKLILKLMFNDLKGHJEKMKKMIJHGLOJLLLKIJLKKJEJLIGG>D NM:i:4 MD:Z:83A13 AS:i:77 XS:i:67
... so it filtered out the tag (i.e. the tag no longer exists in the SAM output), but not the read. There are no useful bits in the FLAG field to indicate multiple genomic mappings (which I know can be filtered to exclude the read as well), so I have to resort to other measures.
In this particular case, I can use grep -v
on the uncompressed SAM output to exclude alignment lines that have the XA
tag (and re-compress to BAM afterwards, just to be tidy):
$ samtools view -h output.bam | grep -v 'XA:Z:' | samtools view -b > output_filtered.bam
$ samtools view output_filtered.bam | grep '^ERR063640\.7[[:space:]]'
<no output>
Hurray! reads filtered. As a little aside, this grep
search has a fairly substantial computational load: it's looking for some string with the text XA:Z:
somewhere in the line, and doesn't actually capture every situation. Some masochistic person might come along at a later date and decide that they're going to call all their reads HAXXA:Z:AWESOME!:<readNumber>
, in which case a tweak to this grep search would be needed to make sure that there's a space (or more specifically a tab character) prior to the XA:Z:
.
Now I do a check for any duplicated read names, just to be sure:
$ samtools view output_filtered.bam | awk '{print $1}' | sort | uniq -d
ERR063640.1194
ERR063640.1429
ERR063640.1761
ERR063640.2336
ERR063640.2825
ERR063640.3458
ERR063640.4421
ERR063640.4474
ERR063640.4888
ERR063640.49
ERR063640.4974
ERR063640.5070
ERR063640.5130
ERR063640.5300
ERR063640.5868
ERR063640.6116
ERR063640.6198
ERR063640.6468
ERR063640.6717
ERR063640.6797
ERR063640.7322
ERR063640.750
ERR063640.7570
ERR063640.7900
ERR063640.8115
ERR063640.8405
ERR063640.911
ERR063640.9206
ERR063640.9765
ERR063640.9986
Oh... damn. I wonder what they are:
$ samtools view output_filtered.bam | grep '^ERR063640.3458[[:space:]]'
ERR063640.3458 16 tig00002961 5402 60 58S38M * 0 0 AGGTACCATTCGATAGAGGGAGAAAGGCACTACTAAAGATTTTGCCACATTTGCTATATCCGTATCGCGAAGATCAGGACTTACTCCGCAGAAGAA DD6HFFJBKFH=KDILKLGLJEKLKGFJIH8IKHLLMJEK:L:HBGJIHJKFLLKIHJDHLNKCK;KMKGMFKJILIIIMKI9JLKKHEJFII?CC NM:i:0 MD:Z:38 AS:i:38 XS:i:0 SA:Z:tig00002377,202353,-,14M3I5M1I35M38S,19,5;
ERR063640.3458 2064 tig00002377 202353 19 14M3I5M1I35M38H * 0 0 AGGTACCATTCGATAGAGGGAGAAAGGCACTACTAAAGATTTTGCCACATTTGCTATA DD6HFFJBKFH=KDILKLGLJEKLKGFJIH8IKHLLMJEK:L:HBGJIHJKFLLKIHJ NM:i:5 MD:Z:5G48 AS:i:35 XS:i:27 SA:Z:tig00002961,5402,-,58S38M,60,0;
Aha! Supplemental alignments, which use the official SA
tag [other canonical alignments in a chimeric alignment]. These appear to be situations where a single read has been split up, and maps to multiple locations. Note that in this case, both of the alignments still have MAPQ scores of over 3. It sounds like the questioner would also want to get rid of these situations as well. This time, there are standard flag fields as well to deal with these situations (0x800: secondary alignment). Except it's not enough to just filter the supplemental alignment, because both read mappings should be removed, rather than just the one (or ones) that happened to be tagged as secondary.
Luckily, BWA appears to put the SA
tag into all reads containing supplementary alignments (if this is not the case, I'm sure someone will correct me on that). So, I add in the SA
search as an additional grep
filter:
$ samtools view -h output.bam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b > output_filtered.bam
$ samtools view output_filtered.bam | awk '{print $1}' | sort | uniq -d
<no output>
Done. Easy peasy! </sarcasm>
... that original "high-effort" solution of using a different aligner doesn't look so bad now.
samtools view
, which I tried, but didn't do what I had expected. I can imagine that samtools might implement something like a tag filter in the future. Maybe it already has, and I'm just using the wrong sub-tool. – gringer Jun 07 '17 at 09:26