When considering deduplication technologies, other vendors and some analysts bring up the bogeyman of hash collisions. Jon Toigo touched upon it again in a recent post to his blog, and Alex McDonald from NetApp brought it up in response to a recent post that Mark Twomey made.
So, what is a hash collision, and is it really a concern for data safety in a system like Permabit Enterprise Archive?
For the long explanation on cryptographic hashes and hash collisions, I wrote a column a bit back for SNW Online, “What you need to know about cryptographic hashes and enterprise storage”. The short version is that deduplicating systems that use cryptographic hashes use those hashes to generate shorter “fingerprints” to uniquely identify each piece of data, and determine if that data already exists in the system. The trouble is, by a mathematical rule called the “pigeonhole principle”, you can’t uniquely map any possible files or file chunk to a shorter fingerprint. Statistically, there are multiple possible files that have the same hash.
“Oh no!” say the critics… or at least, the critics without scalable deduplication. If multiple files could theoretically have the same hash, that means this will happen, and data will be corrupted! The sky will fall!
Well, no, not really. With older hash technologies, like MD5 which generates 128-bit hashes, this was almost a reasonable concern. With the modern SHA-2 family of hashes, data corruption due to hash collision is, for all intents and purposes, impossible.
We can get into lots of trouble when we’re talking about statistics. Let’s look at how likely one of these dreaded hash collisions occur. Due to another principle, the birthday paradox, a hash collision in a pool of documents becomes 50% likely at around the square-root of the number of possible hash values. (This is called the birthday paradox because the probability follows the same rule as the chance of two people in a room having the same birthday.) For SHA-512, a 512-bit hash, that number of documents is just about 2^256. And 2^256 is truly astronomical number.
How big is 2^256? It’s about 116 trillion trillion trillion trillion trillion trillion. A bit larger than 1 followed by 77 zeros. To put a scale on it, it’s significantly more than the number of atoms in our galaxy (around 2^226).
Statically, that means a collision is incredibly unlikely. More unlikely that you and 12 of your colleagues being independently struck by lightning. More unlikely than your house being struck by a meteor… five times. To reach 2^256, you would have to write a document every millisecond for 2^221 years. Thats 4 million trillion trillion trillion trillion trillion years. More unlikely than just about anything you can imagine… but, I will admit, not exactly zero.
Should we panic? Of course not. The problem with focussing on hash collisions as this horrible possibility for data corruption ignores all of the other, far more likely, ways in which data could be corrupted. It’s like buying a carbon fiber bicycle because you’re worried about getting hit by lightning, and then riding without a helmet through downtown Boston traffic. Humans are really bad at gauging risk, and so sometimes we get worried about the wrong things, or things we shouldn’t be worried about at all.
In my next post I’ll explain how not only should we not be worried about hash collisions in systems that use SHA-2 family hashes, but also how much more severe problems can (statistically) strike other systems.