8

When counting canonical kmers, ie kmers in which both the forward and reverse complement of a sequence are treated as identical, how do kmer counting programs decide which kmer to use as the canonical sequence? Do they all work the same way?

To investigate I made a string with GAGTGCGGAATACCACTCTT which contains all 16 possible 2mers. I then used kmc to figure out how they determine which kmer is used. Only the kmers in the filtered column below appeared. So, it looks like KMCs' 'canonical' kmers are the ones that first occur alphabetically.

╔════════════════╦═════╦════════════════════╦══════════╗
║ Possible Kmers ║ RCs ║ RC occurs earlier? ║ filtered ║
╠════════════════╬═════╬════════════════════╬══════════╣
║ TT             ║ AA  ║ YES                ║ TA       ║
║ TG             ║ CA  ║ YES                ║ GC       ║
║ TC             ║ GA  ║ YES                ║ GA       ║
║ TA             ║ TA  ║                    ║ CG       ║
║ GT             ║ AC  ║ YES                ║ CC       ║
║ GG             ║ CC  ║ YES                ║ CA       ║
║ GC             ║ GC  ║                    ║ AT       ║
║ GA             ║ TC  ║                    ║ AG       ║
║ CT             ║ AG  ║ YES                ║ AC       ║
║ CG             ║ CG  ║                    ║ AA       ║
║ CC             ║ GG  ║                    ║          ║
║ CA             ║ TG  ║                    ║          ║
║ AT             ║ AT  ║                    ║          ║
║ AG             ║ CT  ║                    ║          ║
║ AC             ║ GT  ║                    ║          ║
║ AA             ║ TT  ║                    ║          ║
╚════════════════╩═════╩════════════════════╩══════════╝

Do all kmer counting programs use the same canonical kmers, and if so do you have documentation explaining this? I wasn't able to find anything in the papers for jellyfish or kmc.

conchoecia
  • 3,141
  • 2
  • 16
  • 40

1 Answers1

9

When a k-mer is identical to its reverse complement, both are canonical. Note that a canonical k-mer is a sequence, irrelevant to its position(s) in the input string. More precisely, give a string $s$, its canonical string is $$ {\rm canonical}(s|h)=\left\{\begin{array}{ll} s & \mbox{if $h(s)<h(\overline{s})$}\\ \overline{s} & \mbox{otherwise}\\ \end{array}\right. $$ where $\overline{s}$ is the Watson-Crick reverse complement of $s$ and $h$ is an arbitrary string hash function. In practice, we most often use 2-bit encoding as $h$. Order under such a hash function is equivalent to the lexicographical order.

user172818
  • 6,515
  • 2
  • 13
  • 29
  • 2
    Stated differently, each program can have different canonical k-mers given the same input. – Devon Ryan Feb 05 '18 at 09:16
  • Thanks for the comment. My less jargon-y translation is: "Kmer counting programs store kmers using a hash, not a string. The hash function produces the same value for a kmer and its Watson-Crick reverse complement. When the kmer counting program prints the counts in human-readable format, it translates the kmer's hash value to a string. Whether one string representation of a kmer or its reverse complement is reported depends on the program-defined "alphabetical order." In the case of KMC, that "alphabetical order" is {ACGT}. This explains why the kmers observed above were reported. – conchoecia Feb 05 '18 at 20:27