Permabits and Petabytes

July 24, 2008

Statistical Demons Lurk Everywhere

Filed under: Jered Floyd — jeredfloyd @ 1:57 am

In my last post I wrote about why hash collisions are fundamentally not something to be concerned about in a deduplicating storage system that uses SHA-2 hashes. Now I’ll use the same logic that other vendors use to attack hash-based systems to demonstrate that their systems may corrupt data even more frequently than hash collisions!

Hash based systems are criticized based on the possible occurrence of an incredibly statistically unlikely event, but such analyses are done only in the abstract of “perfect” hardware. Let’s talk about the systems on which data storage is actually done.

NetApp claims that their A-SIS deduplication is more robust because after using a 16-bit hash to find candidate blocks, they do a full binary compare on every possible match. Leaving aside the terrible performance implications of this (more on that later), what do the statistics say here? That is, what are the chances that routine doing the bit-by-bit compare says the data matches when it actually doesn’t? (more…)

July 18, 2008

What do Hash Collisions Really Mean?

Filed under: Jered Floyd — jeredfloyd @ 9:06 pm

When considering deduplication technologies, other vendors and some analysts bring up the bogeyman of hash collisions. Jon Toigo touched upon it again in a recent post to his blog, and Alex McDonald from NetApp brought it up in response to a recent post that Mark Twomey made.

So, what is a hash collision, and is it really a concern for data safety in a system like Permabit Enterprise Archive?

For the long explanation on cryptographic hashes and hash collisions, I wrote a column a bit back for SNW Online, “What you need to know about cryptographic hashes and enterprise storage”. The short version is that deduplicating systems that use cryptographic hashes use those hashes to generate shorter “fingerprints” to uniquely identify each piece of data, and determine if that data already exists in the system. The trouble is, by a mathematical rule called the “pigeonhole principle”, you can’t uniquely map any possible files or file chunk to a shorter fingerprint. Statistically, there are multiple possible files that have the same hash. (more…)

Blog at