When counting canonical kmers, ie kmers in which both the forward and reverse complement of a sequence are treated as identical, how do kmer counting programs decide which kmer to use as the canonical sequence? Do they all work the same way?
To investigate I made a string with GAGTGCGGAATACCACTCTT
which contains all 16 possible 2mers. I then used kmc to figure out how they determine which kmer is used. Only the kmers in the filtered
column below appeared. So, it looks like KMCs' 'canonical' kmers are the ones that first occur alphabetically.
╔════════════════╦═════╦════════════════════╦══════════╗
║ Possible Kmers ║ RCs ║ RC occurs earlier? ║ filtered ║
╠════════════════╬═════╬════════════════════╬══════════╣
║ TT ║ AA ║ YES ║ TA ║
║ TG ║ CA ║ YES ║ GC ║
║ TC ║ GA ║ YES ║ GA ║
║ TA ║ TA ║ ║ CG ║
║ GT ║ AC ║ YES ║ CC ║
║ GG ║ CC ║ YES ║ CA ║
║ GC ║ GC ║ ║ AT ║
║ GA ║ TC ║ ║ AG ║
║ CT ║ AG ║ YES ║ AC ║
║ CG ║ CG ║ ║ AA ║
║ CC ║ GG ║ ║ ║
║ CA ║ TG ║ ║ ║
║ AT ║ AT ║ ║ ║
║ AG ║ CT ║ ║ ║
║ AC ║ GT ║ ║ ║
║ AA ║ TT ║ ║ ║
╚════════════════╩═════╩════════════════════╩══════════╝
Do all kmer counting programs use the same canonical kmers, and if so do you have documentation explaining this? I wasn't able to find anything in the papers for jellyfish
or kmc
.