What do Hash Collisions Really Mean?

July 18, 2008

What do Hash Collisions Really Mean?

Filed under: Jered Floyd — jeredfloyd @ 9:06 pm

When considering deduplication technologies, other vendors and some analysts bring up the bogeyman of hash collisions. Jon Toigo touched upon it again in a recent post to his blog, and Alex McDonald from NetApp brought it up in response to a recent post that Mark Twomey made.

So, what is a hash collision, and is it really a concern for data safety in a system like Permabit Enterprise Archive?

For the long explanation on cryptographic hashes and hash collisions, I wrote a column a bit back for SNW Online, “What you need to know about cryptographic hashes and enterprise storage”. The short version is that deduplicating systems that use cryptographic hashes use those hashes to generate shorter “fingerprints” to uniquely identify each piece of data, and determine if that data already exists in the system. The trouble is, by a mathematical rule called the “pigeonhole principle”, you can’t uniquely map any possible files or file chunk to a shorter fingerprint. Statistically, there are multiple possible files that have the same hash.

“Oh no!” say the critics… or at least, the critics without scalable deduplication. If multiple files could theoretically have the same hash, that means this will happen, and data will be corrupted! The sky will fall!

Well, no, not really. With older hash technologies, like MD5 which generates 128-bit hashes, this was almost a reasonable concern. With the modern SHA-2 family of hashes, data corruption due to hash collision is, for all intents and purposes, impossible.

We can get into lots of trouble when we’re talking about statistics. Let’s look at how likely one of these dreaded hash collisions occur. Due to another principle, the birthday paradox, a hash collision in a pool of documents becomes 50% likely at around the square-root of the number of possible hash values. (This is called the birthday paradox because the probability follows the same rule as the chance of two people in a room having the same birthday.) For SHA-512, a 512-bit hash, that number of documents is just about 2^256. And 2^256 is truly astronomical number.

How big is 2^256? It’s about 116 trillion trillion trillion trillion trillion trillion. A bit larger than 1 followed by 77 zeros. To put a scale on it, it’s significantly more than the number of atoms in our galaxy (around 2^226).

Statically, that means a collision is incredibly unlikely. More unlikely that you and 12 of your colleagues being independently struck by lightning. More unlikely than your house being struck by a meteor… five times. To reach 2^256, you would have to write a document every millisecond for 2^221 years. Thats 4 million trillion trillion trillion trillion trillion years. More unlikely than just about anything you can imagine… but, I will admit, not exactly zero.

Should we panic? Of course not. The problem with focussing on hash collisions as this horrible possibility for data corruption ignores all of the other, far more likely, ways in which data could be corrupted. It’s like buying a carbon fiber bicycle because you’re worried about getting hit by lightning, and then riding without a helmet through downtown Boston traffic. Humans are really bad at gauging risk, and so sometimes we get worried about the wrong things, or things we shouldn’t be worried about at all.

In my next post I’ll explain how not only should we not be worried about hash collisions in systems that use SHA-2 family hashes, but also how much more severe problems can (statistically) strike other systems.

Comments (6)

6 Comments »

[…] for the bit-by-bit compare to signal a false match, or about once every 2^59 years. By comparison, last week I showed that SHA-512 would take 2^221 years to encounter a hash […]

Pingback by Statistical Demons Lurk Everywhere « Permabits and Petabytes — July 24, 2008 @ 1:57 am

Reply
[…] You Can’t Retrofit Dedupe Filed under: Jered Floyd — jeredfloyd @ 9:14 pm As I wrote about last month, hash collisions are not something to be concerned about in a properly designed deduplicating […]

Pingback by Jet Engine on a Duck: You Can’t Retrofit Dedupe « Permabits and Petabytes — August 18, 2008 @ 9:14 pm

Reply
[…] there’s been a good amount of FUD about deduplication. I’ve already talked about hash collisions and explained why you’re more likely to spontaneously combust that have a problem there. […]

Pingback by Deduplication is Not a Crime « Permabits and Petabytes — August 22, 2008 @ 11:09 am

Reply
Hash Collisions… why don’t you simply sign a messege with two different hashes… This wouldmake it impossible to forge a document.

Comment by See EYE ay — June 15, 2009 @ 11:51 pm

Reply
- That certainly make it more difficult, but not impossible! You simply need to find input text that generates collisions with both hashes, of which there are still infinitely many.
  
  Comment by jeredfloyd — June 19, 2009 @ 12:05 pm
  
  Reply
Read this article again 5000 years from now, when your mother in law gets lost in some strange transporter malfunction 😉

Comment by Johnny — May 16, 2011 @ 12:46 pm

Reply