In my last post I wrote about why hash collisions are fundamentally not something to be concerned about in a deduplicating storage system that uses SHA-2 hashes. Now I’ll use the same logic that other vendors use to attack hash-based systems to demonstrate that their systems may corrupt data even more frequently than hash collisions!
Hash based systems are criticized based on the possible occurrence of an incredibly statistically unlikely event, but such analyses are done only in the abstract of “perfect” hardware. Let’s talk about the systems on which data storage is actually done.
NetApp claims that their A-SIS deduplication is more robust because after using a 16-bit hash to find candidate blocks, they do a full binary compare on every possible match. Leaving aside the terrible performance implications of this (more on that later), what do the statistics say here? That is, what are the chances that routine doing the bit-by-bit compare says the data matches when it actually doesn’t? (more…)