RSS Feed

Monthly Archives: March 2016

Stupid Hash Functions

I’m getting very tired of reading about people implementing Hashable using the following (this is in Swift):

var hashValue : Int {
    get {
        return x ^ y   // x and y are Int
    }
}

Okay, first, let’s examine why this is wrong. The point of a hash function is to “create random data from the nonrandom data” (Knuth, Sorting and Searching) and since x ^ y is equal to y ^ x, it can hardly be considered a method of creating random data. Let us use some real world numbers (represented as hex)

  1. 0x2A ^ 0x40 = 0x6A
  2. 0x40 ^ 0x2A = 0x6A
  3. 0x00 ^ 0x6A = 0x6A
  4. 0x6A ^ 0x00 = 0x6A

A decent hash function? Hardly.

“Oh, but that is what Equitable is for” people have murmured. No, it’s because you slept during class about hashing. Hashing isn’t fun and yes it is hard to create a good hash function, but you didn’t even bother trying.

I don’t like to write hash functions either, but I have at least a basic fundamental understanding of the problems inherent in hash functions. It didn’t take me but a few seconds to come up with four test cases that produced exactly the same hash. Worse, x and y can be swapped and result in the same hash. Position should influence the result of a hash function.

But let us extend the problem to a buffer of four items, w, x, y, z, if you exclusive-or’ed them, you would get the same result had you done it z, y, x, w (go ahead, try it). The problem is that XOR is a communative operation exactly like addition. In fact, in hardware, XOR used to be called “half-adding” because the carry never influenced the next position.

  1. 0x2B ^ 0x5A ^ 0x11 = 0x60
  2. 0x2B ^ 0x11 ^ 0x5A = 0x60
  3. 0x11 ^ 0x5A ^ 0x2B = 0x60
  4. 0x11 ^ 0x2B ^ 0x5A = 0x60
  5. 0x5A ^ 0x11 ^ 0x2B = 0x60
  6. 0x5A ^ 0x2B ^ 0x11 = 0x60

As you can see, three bytes in different order produce exactly the same value. Exclusive or is obviously a poor way to create a hash value.

So this time, instead of pulling something out of our nether regions, let’s try something that makes a little more sense:

var hashValue : Int {
    get {
        return (x * 0x21) ^ y   // x and y are Int
    }
}

This is better. Why? Because we have now made sure that the values are based on their position. This becomes really important when you have a large buffer that needs to be hashed. So, using our previous values we have:

  1. (0x2A * 0x21) ^ 0x40 = 0x52A
  2. (0x40 * 0x21) ^ 0x2A = 0x86A
  3. (0x00 * 0x21) ^ 0x6A = 0x06A
  4. (0x6A * 0x21) ^ 0x00 = 0xDAA

This is a lot closer to creating random data from nonrandom data. Possibly the only minor irritant is the third case where anything with zero will result in the y value as the hash. We can correct this by using a simplified version of Daniel J. Bernstein’s hashing method:

var hashValue : Int {
    get {
        var hash = 5381        
        hash = (hash * 33) + x
        hash = (hash * 33) + y
        return hash
    }
}

Again, we assured that x and y are positional and that zero now affects the resultant hash.

I have hoped I have shown why hashing is not a simple science and why just exclusive-or’ing will result in a horrible hash.