7

This question is a follow up from How do kmer counters determine which kmer is 'canonical'?.

In that question we learned that kmer counting programs use a 2-bit hash function to internally represent canonical kmers as they are being counted.

Now I am wondering, how can we implement such a function in C/C++ or in python? More arbitrarily, how could I calculate the canonical kmer hash value using a mathematical function?

For example, how would we transform the 3-mer GAT or the 21-mer GAATACCATAGGATA to 1s and 0s such that:

hash(GAT) == hash(ATC)
hash(GAATACCATAGGATA) == hash(TATCCTATGGTATTC)
conchoecia
  • 3,141
  • 2
  • 16
  • 40

2 Answers2

7

You can convert any string hash function to a "canonical" DNA string hash function.

Given a DNA string $s$, let $\overline{s}$ be its Watson-Crick reverse complement. Suppose $h:\Sigma^*\to\mathbb{Z}$ is an arbitrary string hash function. You can define $$ \tilde{h}(s)\triangleq \min\{h(s),h(\overline{s})\} $$ Then for any DNA string $s$ $$ \tilde{h}(s)=\tilde{h}(\overline{s})=\tilde{h}({\rm canonical}(s|h)) $$

PS: There are more than one ways to define "canonical" hash function. For example, let $h_0(\cdot)$ be the 2-bit encoding hash function. Given any integer hash function $g(\cdot)$, we can define: $$ h'(s)\triangleq g\big[\min\{h_0(s),h_0(\overline{s})\}\big] $$ Then we still have $$ h'(s)=h'(\overline{s}) $$ You can also replace $\min\{\}$ with XOR or plus to get new definitions.

user172818
  • 6,515
  • 2
  • 13
  • 29
5

However you want.

One way to do this is to generate both the forward and reverse complement kmers, then choose the lexicographically-least kmer for the storage key. To delve further into this requires discussion about things like the size of the underlying array, the expected distribution of keys throughout the array, and what type of key clustering is desirable/undesirable. What is actually done doesn't matter, as long as there is a deterministic algorithm that can be applied to every single possible sequence.

The nice thing about most modern languages is that they've already got their own hash implementations that map keys to values, which work fine for quick experiments: C++ has unordered map, perl has hashes, python has dictionaries, and R has names.

The actual bare-metal implementation of the thing that converts a string of text into a number is complicated, particularly if you want code to be cryptographically secure, and the hash array to have an access speed that is very close to the amount of time it takes to retrieve a single value from memory. A useful starting point to explore this in more depth might be the Wikipedia page on the Java hashcode function.

gringer
  • 14,012
  • 5
  • 23
  • 79