19

This is based on a question from betsy.s.collins on BioStars. The original post can be found here.

Does anyone have any suggestions for other tags or filtering steps on BWA-generated BAM files that can be used so reads only map to one location? One example application would be to find seeds for the TULIP assembler/scaffolder, which works best for reads that map to unique genomic locations.

The questioner is referring to genomically-unique alignments, which are something different from multiple possible alignments at a single location. If the only multiple alignment is at the same genomic location, those alignments are still unique in the genomic sense. The impossibility of finding the exact, correct, alignment is a well-known problem, and there are a few downstream methods (e.g. left normalisation) to make sure that multiple local alignments and/or sequencing errors have reduced impact.

The concept of "uniquely mapped reads" is a loaded term, and most sources suggest filtering by MAPQ should do the trick. However, this approach doesn't seem to work when using BWA as a read mapper.

Uniquely mapped reads (i.e. a read that maps to a single location in the genome) are sometimes preferred when running analyses that depend on the quantification of reads, rather than just the coverage (e.g. RNASeq). Duplicated reads require additional filtering or normalisation on the analysis side, and most downstream programs won't consider concepts like "a fifth of the read maps somewhere here on chromosome 5, and four-fifths of the read maps somewhere here on chromosome 14."

gringer
  • 14,012
  • 5
  • 23
  • 79

2 Answers2

13

Update - as of January 2021, samtools can now do filtering based on an expression that includes tag variables. In this case, this expression can be used to exclude any reads that have either an XA or SA tag:

samtools view -b mapped.bam -e '!([XA] | [SA])' > unique_mapped.bam

For more details on the samtools expression parameter, see the samtools documentation:

https://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS


Original answer follows....

To exclude all possible multi-mapped reads from a BWA-mapped BAM file, it looks like you need to use grep on the uncompressed SAM fields:

samtools view -h mapped.bam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b > unique_mapped.bam

Explanation follows...

I'm going to assume a situation in which a bioinformatician is presented with a mapped BAM file produced by BWA, and has no way of getting the original reads. One high-effort solution would be to extract the mapped reads from the BAM file and re-map with a different mapper that uses the MAPQ score to indicate multiple mappings.

... but what if that were not possible?

My understanding of BWA's output is that if a read maps perfectly to multiple genomic locations, it will be given a high mapping quality (MAPQ) score for both locations. Many people expect that a read that maps to at least two locations can have (at best) a 50% probability of mapping to one of those locations (i.e. MAPQ = 3). Because BWA doesn't do this, it makes it difficult to filter out multiply-mapped reads from BWA results using the MAPQ filter that works for other aligners; this is likely to be why the current answer on Biostars [samtools view -bq 1] probably won't work.

Here is an example line from a BWA mem alignment that I've just made. These are Illumina reads mapped to a parasite genome that has a lot of repetitive sequence:

ERR063640.7     16      tig00019544     79974   21      21M2I56M1I20M   *       0       0       TATCACATATCATCCGACTCAGCTCGACGAGTACAATGCTAATTTAACACTTAGAATGCCCGGCAATGAAATTCGTTTTCCGTCAATTCTTGAAAATTTC    <AABBEGABFJKKKIM7GHKKJK>JLKLDGMHLIMIHHCGIJKKLJKLNJGLLLKLILKLMFNDLKGHJEKMKKMIJHGLOJLLLKIJLKKJEJLIGG>D    NM:i:4  MD:Z:83A13      AS:i:77 XS:i:67 XA:Z:tig00019544,-78808,21M2I56M1I20M,6;tig00019544,-84624,79M1I20M,6;tig00019544,-79312,33M4I42M1I20M,8;

BWA mem has found that this particular read, ERR063460.7, maps to at least three different locations: tig00019544, tig00019544, and tig00019544. Note that the MAPQ for this read is 21, so even though the read maps to multiple locations, MAPQ can't be used to determine that.

However, the alternative locations are shown by the presence of the XA tag in the custom fields section of the SAM output. Perhaps just filtering on lines that contain the XA tag will be able to exclude multiply-mapped reads. The samtools view man page suggests that -x will filter out a particular tag:

$ samtools view -x XA output.bam | grep '^ERR063640\.7[[:space:]]'
ERR063640.7 16  tig00019544 79974   21  21M2I56M1I20M   *   0   0   TATCACATATCATCCGACTCAGCTCGACGAGTACAATGCTAATTTAACACTTAGAATGCCCGGCAATGAAATTCGTTTTCCGTCAATTCTTGAAAATTTC    <AABBEGABFJKKKIM7GHKKJK>JLKLDGMHLIMIHHCGIJKKLJKLNJGLLLKLILKLMFNDLKGHJEKMKKMIJHGLOJLLLKIJLKKJEJLIGG>D    NM:i:4  MD:Z:83A13  AS:i:77 XS:i:67

... so it filtered out the tag (i.e. the tag no longer exists in the SAM output), but not the read. There are no useful bits in the FLAG field to indicate multiple genomic mappings (which I know can be filtered to exclude the read as well), so I have to resort to other measures.

In this particular case, I can use grep -v on the uncompressed SAM output to exclude alignment lines that have the XA tag (and re-compress to BAM afterwards, just to be tidy):

$ samtools view -h output.bam | grep -v 'XA:Z:' | samtools view -b > output_filtered.bam
$ samtools view output_filtered.bam | grep '^ERR063640\.7[[:space:]]'
<no output>

Hurray! reads filtered. As a little aside, this grep search has a fairly substantial computational load: it's looking for some string with the text XA:Z: somewhere in the line, and doesn't actually capture every situation. Some masochistic person might come along at a later date and decide that they're going to call all their reads HAXXA:Z:AWESOME!:<readNumber>, in which case a tweak to this grep search would be needed to make sure that there's a space (or more specifically a tab character) prior to the XA:Z:.

Now I do a check for any duplicated read names, just to be sure:

$ samtools view output_filtered.bam | awk '{print $1}' | sort | uniq -d
ERR063640.1194
ERR063640.1429
ERR063640.1761
ERR063640.2336
ERR063640.2825
ERR063640.3458
ERR063640.4421
ERR063640.4474
ERR063640.4888
ERR063640.49
ERR063640.4974
ERR063640.5070
ERR063640.5130
ERR063640.5300
ERR063640.5868
ERR063640.6116
ERR063640.6198
ERR063640.6468
ERR063640.6717
ERR063640.6797
ERR063640.7322
ERR063640.750
ERR063640.7570
ERR063640.7900
ERR063640.8115
ERR063640.8405
ERR063640.911
ERR063640.9206
ERR063640.9765
ERR063640.9986

Oh... damn. I wonder what they are:

$ samtools view output_filtered.bam | grep '^ERR063640.3458[[:space:]]'
ERR063640.3458  16  tig00002961 5402    60  58S38M  *   0   0   AGGTACCATTCGATAGAGGGAGAAAGGCACTACTAAAGATTTTGCCACATTTGCTATATCCGTATCGCGAAGATCAGGACTTACTCCGCAGAAGAA    DD6HFFJBKFH=KDILKLGLJEKLKGFJIH8IKHLLMJEK:L:HBGJIHJKFLLKIHJDHLNKCK;KMKGMFKJILIIIMKI9JLKKHEJFII?CC    NM:i:0  MD:Z:38 AS:i:38 XS:i:0  SA:Z:tig00002377,202353,-,14M3I5M1I35M38S,19,5;
ERR063640.3458  2064    tig00002377 202353  19  14M3I5M1I35M38H *   0   0   AGGTACCATTCGATAGAGGGAGAAAGGCACTACTAAAGATTTTGCCACATTTGCTATA  DD6HFFJBKFH=KDILKLGLJEKLKGFJIH8IKHLLMJEK:L:HBGJIHJKFLLKIHJ  NM:i:5  MD:Z:5G48   AS:i:35 XS:i:27 SA:Z:tig00002961,5402,-,58S38M,60,0;

Aha! Supplemental alignments, which use the official SA tag [other canonical alignments in a chimeric alignment]. These appear to be situations where a single read has been split up, and maps to multiple locations. Note that in this case, both of the alignments still have MAPQ scores of over 3. It sounds like the questioner would also want to get rid of these situations as well. This time, there are standard flag fields as well to deal with these situations (0x800: secondary alignment). Except it's not enough to just filter the supplemental alignment, because both read mappings should be removed, rather than just the one (or ones) that happened to be tagged as secondary.

Luckily, BWA appears to put the SA tag into all reads containing supplementary alignments (if this is not the case, I'm sure someone will correct me on that). So, I add in the SA search as an additional grep filter:

$ samtools view -h output.bam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b > output_filtered.bam
$ samtools view output_filtered.bam | awk '{print $1}' | sort | uniq -d
<no output>

Done. Easy peasy! </sarcasm>

... that original "high-effort" solution of using a different aligner doesn't look so bad now.

gringer
  • 14,012
  • 5
  • 23
  • 79
  • 1
    that's a neat solution, and well argued. I don't think it's that high effort if you condense the whole thing to a few lines of shell script either. :) – roblanf Jun 07 '17 at 06:15
  • 2
    The thing with using a MAPQ to filter out multimappers is that it uses the presumption of a multimapper being "equally well mapping", which for even the examples presented here won't be the case. I've set your reply as the accepted answer on biostars but asked op there to clarify a bit on what exactly is wanted. – Devon Ryan Jun 07 '17 at 06:43
  • Yes, it's not hard if you know what to expect, which is why I've tried to run through all the problems that might happen. – gringer Jun 07 '17 at 06:57
  • About the read names possibly containing "XA:Z:", a way to achieve robustness would be to use a proper SAM format parser, such as pysam, if one programs in python, rather than grep. I don't know how much slower this would be. – bli Jun 07 '17 at 08:34
  • Or like samtools view, which I tried, but didn't do what I had expected. I can imagine that samtools might implement something like a tag filter in the future. Maybe it already has, and I'm just using the wrong sub-tool. – gringer Jun 07 '17 at 09:26
  • Okay, so it's not yet available in samtools, but the idea of doing tag filtering has already been raised in at least one issue – gringer Jun 07 '17 at 09:34
  • Worst case, you can write a couple lines of python or likely use Pierre's jvarkit to filter things out. – Devon Ryan Jun 07 '17 at 10:03
4

For much faster and stable$ implementation, sambamba can handle this using following one-liner:

sambamba view -t 12 -h -f bam -F "mapping_quality >= 1 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)" test.bwa-mem.bam -o test-uniq.bam

See manpage and sambamba synatx for filters for details on further usage.

$: Besides grep based search being slow, it may return false positive negative hits (@gringer's reply above). I also do not find awk based approach reliable because fields like XA, SA, etc. are optional fields in SAM format, and not positionally fixed fields.

Samir
  • 151
  • 1
  • 7
  • In what way were the hits false positive? – gringer Jun 22 '18 at 06:05
  • Hmm.. I guess it should be false negative because you were excluding those search patterns with grep -v, right? – Samir Jun 22 '18 at 15:45
  • The only likely way grep could produce the wrong results is if someone modified the read names or a read group to contain one of those strings, which seems quite unlikely. – Devon Ryan Jun 22 '18 at 19:35
  • agree, it is quite unlikely but nonetheless, grep based approach is inherently slow. – Samir Jun 22 '18 at 20:38
  • 1
    The slow part is in the conversion from BAM to SAM by samtools view. The grep program is very quick at what it does. – gringer Jun 22 '18 at 20:58