As I wrote about last month, hash collisions are not something to be concerned about in a properly designed deduplicating storage system, despite what some FUD vendors would like you to think. You don’t have to take just my word for it; Curtis Preston wrote about this last year too.
In fact, hash-based systems are the most likely to be capable of handling enormous amounts of data for deduplication; the challenges in building an efficient system for matching hashes (or fingerprints) are entirely different from the sorts of problems storage system builders have had to solve in the past. Permabit set out from day one to build a system capable of efficiently deduplicating petabytes of storage, and this technology is realized in Enterprise Archive. We’ve developed our own file systems and distributed transaction managers to solve just these problems, so for any chunk of data written to our system we can determine in mere milliseconds if we’ve seen that information before.
Because we have this in-line processing of incoming data, there’s never a need for spare storage space to cache data for later deduplication, and there’s never a deduplication window where the system has to pause to “catch up” with data that’s been written so far. These are some of the key benefits to in-line deduplication. The story isn’t so good for non-purpose-built systems, unfortunately.
Since the details are out there, let’s look how NetApp has added a dedupe “checkbox” to their appliances. Personally, I think NetApp makes fantastic devices and is a leader in high-performance, easy to use NAS, but I also think their dedupe just doesn’t cut it.
As Beth Pariseau’s article describes, NetApp’s A-SIS only dedupes within a single volume, up to 16 TB in size. This severely limits the pool of data that is deduplicated together. In a backup-to-disk case this isn’t fatal because a lot of redundant data gets written frequently, but if you’re using a Filer for D2D backup then you clearly aren’t worried about cost! Such a small dedupe pool is a big problem for deduplicating archive data, where the opportunities for deduplication are fewer and further between. If you have hundreds of terabytes of data, having to manually separate your files into individual 16 TB silos will significantly hurt your deduplication ratios.
It gets worse from there. The A-SIS deduplication is built on top of the existing 16-bit checksums that NetApp’s WAFL file systems use for additional integrity protection. (We have similar integrity checksums on top of our SHA-256 fingerprinting.) This is where trying to retrofit deduplication into an existing performance file system goes astray.
A-SIS deduplication occurs as a scheduled daily process, and we’ll see why in just a moment. But first, this means that you need to have enough space to store all the data the system hasn’t gotten around to deduplicating yet, potentially many terabytes. When the deduplication process kicks off, it looks at every newly written 4 KB WAFL block and tries to figure out if that block matches any other block written in the same volume. To do this, it uses that 16-bit checksum as a basic hash.
A 16-bit checksum can only take 65536 different values, and unlike a 256-bit (or larger) hash that’s definitely a problem for hash collisions. So, to make sure they’ve got the right block, they perform a full bit-wise compare against all the candidate blocks. Sounds good, right?
This solution isn’t going to corrupt your data, but the performance will be terrible! A 16 TB file system will have up to 4 billion 4K WAFL blocks. With 64K possible checksum values, each checksum bucket in their database will have on average 64K candidate blocks. That means that for every newly written block, the A-SIS routine will have to check the full data against an average of 65536 other blocks.
Those blocks that need to be checked will be spread evenly across the drives in the system, which means each comparison will require a disk seek. Even with 15K RPM drives, the average seek time is 3.3 ms, giving about 300 seeks per second. That means that on a full file system, it will take more than three and a half minutes to deduplicate a single block!
To put this in even more perspective, this means that with only 1440 minutes in the day, only around 400 blocks (or 1.6 MB) could be deduplicated in a single day if the system was running in-line. Post-processing helps them some here; presumably as part of the nighly dedupe process A-SIS considers all newly written blocks with the same fingerprint (all of which can be compared at the same time), but this still only allows 1440 of 65536, or 2% of the possible blocks, to be considered each night.
If I’ve gotten anything wrong in the above, I’m happy to correct the errors. From field reports, though, A-SIS just doesn’t perform, and it’s clear why.
To deduplicate hundreds of terabytes of data in a single storage pool, it’s necessary to build a system from the ground up to be designed for deduplication. Attempts to retrofit deduplication onto an existing storage system, even a great one, simply won’t scale. Trying to do so is like strapping a jet engine onto a duck, and hoping it will soar.