23

A common bioinformatics task is to decompose a DNA sequence into its constituent k-mers and compute a hash value for each k-mer. Rolling hash functions are an appealing solution for this task, since they can be computed very quickly. A rolling hash does not compute each the hash value from scratch with each k-mer: rather it updates a running hash value using an update strategy and a sliding window over the data.

It's also very useful for many applications to have a k-mer hash to the same value as its reverse complement. Unless the data were generated using a strand-specific sample prep, it's impossible to distinguish a k-mer from its reverse complement, and they should be treated as the same sequence.

Are there any rolling hashes that will map reverse complements to the same value? If not, how would we develop such an algorithm?

UPDATE: Ideally the hash function would be able to support k > 32, which would be lossy unless using something larger than a 64-bit integer.

ANOTHER UPDATE: I don't think it's necessary to store both the running k-mer and its reverse complement in a single value. If storing two k-mer strings and/or two hash values makes this easier, I'm totally cool with that.

Daniel Standage
  • 5,080
  • 15
  • 50
  • I don't understand. Surely if you have the kmer you also know the reverse complement of that kmer. It would be a matter of mapping the forward hash to its complement. – morgantaschuk May 16 '17 at 19:29
  • 1
    See related question on biostars – ChrisD May 16 '17 at 19:29
  • @morgantaschuk Indeed, it's trivial to compute a k-mer's reverse complement. But you don't want to do this from scratch for each k-mer, since that would negate any performance benefit gained from the rolling hash. – Daniel Standage May 16 '17 at 19:57
  • 1
    Can you please specify in your question whether the rolling-hash function you envisage is a lossless one (trivial) or a lossy function? E.g, for lossy: mapping a 64 kmer (128 bit info) to a 32 bit hash value. – BaCh May 16 '17 at 20:36
  • see also: https://www.biostars.org/p/184993/ – Pierre May 17 '17 at 10:26

4 Answers4

8

Let f and r be two integers. They always keep the k-mer on the forward and reverse strand, respectively. At a new base c in a proper 2-bit encoding, we update the two integers as follows:

f = (f<<2|c) & ((1ULL<<2*k) - 1)
r = r>>2 | (3ULL-c)<<2*(k-1)

With this updating rule, f keeps the forward strand k-mer ending at c and r is f's reverse complement. The hash of the k-mer can be min(f,r). If integer operations take constant time, computing the hashes of all k-mers of a sequence of length L takes O(L) time.

The above only works if k-mer fits a word, which means k<=32 on x86. For long k-mers, you can overload bit operators in C++. In case of 32<k<=64, a simpler solution is to split into lower and higher bits:

c1 = c&1, c2 = c>>1
f1 = (f1<<1|c1) & ((1ULL<<k) - 1)
f2 = (f2<<1|c2) & ((1ULL<<k) - 1)
r1 = r1>>1 | (1ULL-c1)<<(k-1)
r2 = r2>>1 | (1ULL-c2)<<(k-1)

This is twice as slow as the k<=32 version. It is possible to implement these lines with SSE2 intrinsics, but that is overkilling.

When you want to hash a long k-mer into fewer bits, there are a few options:

xor:      f^r
min:      min(f,r)
min+hash: hash(min(f,r))

where hash() is a generic randomization hash function, which could be Murmur3, Wang's integer hash function, etc.

The following is a microbenchmark. Here, we squeeze all k-mers into a 32-bit hash table. collision_rate = 1 - #distinct_hash/#distinct_kmer. This is not the best metric, but it somehow measures the randomness, which is better than nothing.

====================================================
 Algorithm    data  max-k  k   %collision  CPU time
----------------------------------------------------
 xor          chr11  32    31     8.5%       0.9s
 xor+Wang     chr11  32    31     9.6%       1.3s
 min+Wang     chr11  32    31     1.4%       1.3s
 min+Wang     chr11  64    31     1.4%       1.9s
 min+murmur3  chr11  inf   31     1.4%       5.7s
 min+Wang     chr11  64    51     1.5%       2.1s
 min+murmur3  chr11  inf   51     1.5%       6.8s
 min+Wang     chr11  64    63     1.5%       2.1s
 min+murmur3  chr11  inf   63     1.5%       7.5s
 radix sort   chr11  -     -      -        5.6-7.3s
----------------------------------------------------
 min+Wang     hg38   64    63    26.2%        37s
 min+murmur3  hg38   inf   63    26.2%       192s
 radix sort   hg38   -     -      -         ~138s
====================================================

Observations:

  • Don't use XOR. In fact, XOR hashes all palindromes to 0xffff.... This is already worrying enough.

  • Unless you want to work with very long k-mers, don't use generic string hash functions like FNV and Murmur. Murmur is even slower than radix sorting all hashes.

  • For k<=64, min+Wang is the best here. It is fast and simple to compute and has randomness nearly as good as murmur.

user172818
  • 6,515
  • 2
  • 13
  • 29
  • No, your solution does not work if the poster wanted, e.g., compute all hashes for a genome sequence. – BaCh May 16 '17 at 20:21
  • Lol, 2-bit encoding is certainly popular but I wouldn't go so far as to call it "proper". But even still, your answer is informative. Add next base to the end of the forward k-mer, add the complement of the next base to the beginning of the reverse k-mer, recompute both hashes. – Daniel Standage May 16 '17 at 21:03
  • 1
    Damn, wish I could upvote this again. Yet it still has fewer votes than the answer that … recommends Murmur?! – Konrad Rudolph May 18 '17 at 18:03
  • Yeah, wish I could +1 again for the follow through! – Daniel Standage May 20 '17 at 00:21
8

A rolling hash function for DNA sequences called ntHash has recently been published in Bioinformatics and the authors dealt with reverse complements:

Using this table, we can easily compute the hash value for the reverse-complement (as well as the canonical form) of a sequence efficiently, without actually reverse- complementing the input sequence, as follows: ...

EDIT (by @user172818): I will add more details about how ntHash works. The notations used in its paper are somewhat uncommon. The source code is more informative.

Let's first define rotation functions for 64-bit integers:

rol(x,k) := x << k | x >> (64-k)
ror(x,k) := x >> k | x << (64-k)

We then define a hash function h() for each base. In the implementation, the authors are using:

h(A) = 0x3c8bfbb395c60474
h(C) = 0x3193c18562a02b4c
h(G) = 0x20323ed082572324
h(T) = 0x295549f54be24456
h(N) = 0

The rolling hash function of a forward k-mer s[i,i+k-1] is:

f(s[i,i+k-1]) := rol(h(s[i]),k-1) ^ rol(h(s[i+1]),k-2) ^ ... ^ h(s[i+k-1])

where ^ is the XOR operator. The hash function of its reverse complement is:

r(s[i,i+k-1]) := f(~s[i,i+k-1])
               = rol(h(~s[i+k-1]),k-1) ^ rol(h(~s[i+k-2]),k-2) ^ ... ^ h(~s[i])

where ~ gives the reverse complement of a DNA sequence. Knowing f(s[i,i+k-1]) and r(s[i,i+k-1]), we can compute their values for the next k-mer:

f(s[i+1,i+k]) = rol(f(s[i,i+k-1]),1) ^ rol(h(s[i]),k)  ^ h(s[i+k])
r(s[i+1,i+k]) = ror(r(s[i,i+k-1]),1) ^ ror(h(~s[i]),1) ^ rol(h(~s[i+k]),k-1)

This works because rol, ror and ^ can all be switched in order. Finally, for a k-mer s, the hash function considering both strands is the smaller between f(s) and r(s):

h(s) = min(f(s),r(s))

This is a linear algorithm regardless of the k-mer length. It only uses simple arithmetic operations, so should be fairly fast. I have briefly tested its randomness. It seems comparable to murmur. ntHash is probably the best algorithm so far if you want to hash an arbitrarily long k-mer into 64 bits.

Karel Břinda
  • 1,909
  • 9
  • 19
  • 2
    I had a look at the paper and did a small experiment. The idea behind ntHash is interesting, simple, efficient and effective. If @DanielS wants to get a 64-bit hash value for a k-mer of arbitrary length, this should be the accepted answer. This hash function however is not the best for k-mer counting as in theory, it may hash two <=32-mers into the same 64-bit integer by a tiny chance. – user172818 May 31 '17 at 18:12
  • @user172818 Thank you for the update. Do you think that it would be possible to easily modify the code to make it work with __int128? It would be just a quick and dirty hack, not a systematic solution, but it could be sufficient for many use cases. – Karel Břinda May 31 '17 at 19:46
  • Collisions in hash functions are expected, and should not be discouraged (as long as the function is sufficiently random). The counting of kmers of arbitrary length can be done by storing the actual value (and count) in a bucket that is addressed by the hash. – gringer May 31 '17 at 20:00
  • @Karel I believe many 128-bit operations can't be done in one CPU cycle. Using __int128 will be slower. In addition, nthash incurs collisions anyway. Using longer words just reduce the chance, but won't solve the root problem. – user172818 May 31 '17 at 20:57
  • @gringer a popular approach to k-mer counting is sorting. We don't need to hash at all. When using a hash table, a better approach is to use invertible hash functions (e.g. jellyfish & bfc). This way, you don't need to compute hashes as often. – user172818 May 31 '17 at 21:04
6

If your goal is to minimise storage by just having one hash per kmer and its reverse complement, there is a simple solution for non-rolling hashes. For any sequence S, you compute and store the hash of the smaller of S and its reverse complement. A simple lexicographical comparison for "smaller" is enough. In programming terms:

hash=computeHash(min(S,rev(S));

If your goal is to minimise computing time via rolling hashes AND store only one kmer value, this will be hard. Hashes are normally designed to minimise collisions and you are asking it not only to collide, but also to collide in very, very specific circumstances.

Question is: is the computation of a hash a bottleneck for an application? Maybe, maybe not.

This post on StackExchange seems to imply that, in 2012, MurmurHash was able to compute de-novo a hash for a UUID (similar to a standard kmer in size and complexity) in ~250ns, i.e., ~4 million hashes per second. That's already quite fast.

My guess is that most applications will use a hash to look up things in memory or disk. And here, even with look-ups in memory, real life data will let you run into CPU cache miss speed penalties very quickly and I expect the impact of this to be higher than the speed of the hash computation itself.

BaCh
  • 734
  • 4
  • 9
  • Murmur is good but conventional DNA k-mer hashes are much, much faster, since they essentially constitute an identity function (the k-mer is its own hash value). And yes, that performance difference is absolutely noticeable in relevant applications. – Konrad Rudolph May 17 '17 at 10:01
4

At its easiest, you just store the forward ($F$) and reverse ($R$) hash value.

You update the forward hash value by conventional means, e.g. bit-shifting the base value into its lower bits:

$$ F_{n + 1} = ((F_n \ll B \mid x) \mathop\& M, $$

$B$ is the bit size of the base encoding, $M$ is the word mask for a word of length $W$ bits, and can be omitted if the width of the hash is the width of the machine word; and the reverse hash value by doing the mathematical inverse, e.g. bit-shifting the value of the reverse complement of the base into the upper bits:

$$ R_{n + 1} = ((R_n \gg B) \mid (\operatorname{compl}x \ll (W - B))) \mathop\& M. $$

And then you compute a combined hash value by xoring the two hashes, $H = F \oplus R$.

Konrad Rudolph
  • 4,845
  • 14
  • 45
  • I was hoping for something that would work more generically (2-bit encoding is limited to k <= 32), but this is very insightful. And for k <= 32, it will likely be much more performant than an adapted generic hash function. – Daniel Standage May 17 '17 at 18:56
  • @DanielStandage, I have answered what to do for k>32: you just use more integers. That will still be fast, times faster than any alternatives. Also, this answer is not perfectly correct. In 2-bit encoding, you need to shift by 2. – user172818 May 17 '17 at 19:57
  • Furthermore, XOR actually doesn't work well. Suppose we only have 0 base and 1 base. 0 and 1 are complement to each other. You can enumerate all 3-mers and their complements, you will find that there are only two possible hash values: 111 or 010. XOR is more likely to cause hash collisions. – user172818 May 17 '17 at 20:06
  • @DanielStandage You say “2-bit encoding is limited to k <= 32” but that’s not actually true, you just need bigger machine words (std::bitset comes to mind, and SeqAn has its own type, AFAIR). But at any rate my scheme isn’t limited to 2-bit encoding; all it requires is that the rolling hash can be computed by shifting in a value at either end. – Konrad Rudolph May 17 '17 at 20:36
  • In our testing, using std::bitset (or rolling our own large int using int arrays) led to a >10x hit to performance. I'd prefer to stick with native int types. – Daniel Standage May 17 '17 at 22:02
  • @DanielStandage Well of course there’s a performance hit. That said, you can of course truncate the k-mer bit representation. This is then of course equivalent to a shorter k-mer (i.e. collisions on the full k-mer) but the pigeonhole lemma trivially proves that it’s not possible to preserve all information without collisions, whatever you do … – Konrad Rudolph May 17 '17 at 22:25
  • Don't use bitset. It has a loop inside! The right way is to set a max k-mer length and manually write bit operations. Supporting up to 64bp, which is often sufficient, should only be twice as slow. SSE2 can shift two 64-bit integers in parallel and is likely to give you a faster implementation. – user172818 May 17 '17 at 23:01
  • @user172818 And have you checked that the loop isn’t unrolled? It has a compile-time constant width. I haven’t checked this particular instance but such loops in general can be unrolled. And modern compilers usually also emit the correct SSE intrinsics if you target a supported architecture. – Konrad Rudolph May 17 '17 at 23:12
  • A good catch. No, I haven't. I was deducing from the 10x performance hit. 64bp should only be twice as slow than 32bp. The 10x hit wouldn't make sense if the compiler were generating the optimal assembly. – user172818 May 17 '17 at 23:17