4

I did a mapping of genomic paired-end reads to a reference assembly using bwa mem. I need to extract the reads that mapped only once to my reference. For that, I have tried to follow the method proposed in another post Obtaining uniquely mapped reads from BWA mem alignment and unfortunately in my case, after using:

$ samtools view -h output.bam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b > output_filtered.bam
$ samtools view output_filtered.bam | awk '{print $1}' | sort | uniq -d

I still obtain some outputs. Here is an example:

NB501461:7:HH3GGAFXX:4:21612:26866:20400    99  ctg7180000256968    441 0   147M4S  =   560 230 AAGCTTTAGTCGCATGGCCATTCACAAGACCAGTGATTTTTTATAGAAGTCGTTCAAGAGATACTGATGAGGAGTACAGCCATTCTTTAAGCTATACAAAAAGCACCACCCACACCACACAATCCTCAAAAAGATCAAGGCAGTACAGATT AAAAAEEAEEAEAAEEEAAEEEAEEEEE/EEEEEEAA/EEEEEEEEEEEEEEAEE/EEEE///E6EAEEEEAEA<EE/EA/</EEAE/EEEEE//EEEEEE6E/<EAE/EEE/EE</E/E//AE<6EA//A<A///<A/6A/6//A///// NM:i:7  MD:Z:34A56T11A4A3A2A22T8    AS:i:112    XS:i:117
NB501461:7:HH3GGAFXX:4:21612:26866:20400    147 ctg7180000256968    560 0   32S111M8S   =   441 -230    CCAGCTCCACCACCCGACCCCCCCACCCACCCCAATCCTCAAACAGATCCATGCAGTACATTTTCATTTGTTCTGGCCCTGCGGCTCTTAGTTTAGGAAATATTCAATGACTTGCTACCTAGTAAGAACATTATTGTGCACAACGAATAGA ///E/////A////A/6A/EE/<//////A//AA/EEE/A/<E///A/A<//</////A/////<///EE////E/E/E/E</E//EA//E//A/</E//E//AE<//A<///EE///////E//E//EA/A//EA/AE/A/E/E//6/A/ NM:i:7  MD:Z:11A5A22A4A4T27A7A24    AS:i:76 XS:i:79
NB501461:7:HH3GGAFXX:4:21612:6399:20400 113 ctg7180000282074    357 8   73S50M28S   ctg7180000432624    28  0   CTTATCCAATTTTTAAACATAGGATTAGGGTACCTTCCAAATGTTATTGAAACTTAATCTGAAGATGACCTCAAATGGAAGATGACCTCCAATGGAAGACAACCTTCCATGGAAGACGACCTCAAAAATGAAGAAGAAGAAAACATCTAAG /A//<//<<//E/66A//<A////A/<</E/EA/EE//<E<E//<AE//A/EEAEEEE/E//A/A/</EEE//E<EAEEEEE/EEEEEE<EEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEAAAAA NM:i:2  MD:Z:32C1A15    AS:i:40 XS:i:35
NB501461:7:HH3GGAFXX:4:21612:6399:20400 177 ctg7180000432624    28  26  76S75M  ctg7180000282074    357 0   CTCCTTACTGTTGTTTCTTCTTTTTTCTTCTCTTTCATTTTTGAGGTTGTCTTCCATGTTTTGTTGTCTTCCATTTGAGGTCGTCTTCATATTAAGTCTCTATAACATTTGGCTTTGTCCCCTAAGCCTAAGTTTCAAATTTGGAAAAGAT /////</////6/////A////EAEE<//////<A////E/E<//A//<///E//<E/////E/A/E//E/EAEEAE/EA///E//E/AE/<EAEE/6E6EEA/EE/AEEAE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:4  MD:Z:5T15T17G29G5   AS:i:55 XS:i:44

As I do not see any XA or SA flags there anymore, I wonder what those can be?

I have tried to continue the steps I want to pursue until the "ValidateSamFile" from Picard tools, and there I do not know if it is linked or not, but I get the message: "Error: mate not found" for a few thousands of reads. Any idea what could be wrong, and how to fix it?

I hope I was clear enough, otherwise please feel free to ask me for specific details. Any help is welcome, Thanks!

Bioathlete
  • 2,574
  • 12
  • 29
Marvin
  • 41
  • 3

2 Answers2

1

The issue is that you are filtering reads without regard to the pairing so you might filter out one of the mates if one aligns uniquely and the other does not. You will need to filter out the un-mated reads from the file to have a valid BAM.

As I do not see any XA or SA flags there anymore, I wonder what those can be?

I am not sure what this means. The grep should remove those flags and only leave records without them.

As a side note if you don't want split alignments I would recommend using a different aligner such as BWA aln or BowTie2

Bioathlete
  • 2,574
  • 12
  • 29
  • my goal with using grep -v was to remove all the lines corresponding to the reads flagged with XA and SA, which mean all the reads that are mapped more than once. I am surprised that I still get duplicates of read names afterwards because there should be none left. – Marvin Aug 29 '17 at 23:41
  • 1
    You have paired reads so one of those is the read 1 and the other is the read2. The 99 maps to read 1, among other things and the 147 maps to read 2. See https://broadinstitute.github.io/picard/explain-flags.html – Bioathlete Aug 30 '17 at 14:20
1

extract the reads that mapped only once to my reference

Every read can be mapped to multiple places if you allow enough mismatches. You heard "unique mapping" from time to time. Strictly speaking, however, this concept is either ill defined or practically useless given 150bp reads. For a proper aligner, the best way to measure the reliability of an alignment is simply mapping quality. All the other approaches, including the answer in the other thread, have problems.

By the way, many tools allow you to set a threshold on mapping quality. You can apply the threshold on the fly. It is a common and good practice to avoid writing filtered alignments to another BAM.

user172818
  • 6,515
  • 2
  • 13
  • 29