Permabits and Petabytes

August 22, 2008

Deduplication is Not a Crime

Filed under: Jered Floyd — jeredfloyd @ 10:00 am

We’re starting to get deep into the election season, so the negative ads are coming fast and furious. Shadowy pictures and a scary voice saying things like “John Smith says that he supports healthy meals for school children, but could it really be because he’s fattening them up to be sold as meat to foreign terrorists? A child-eating terrorist supporter? Is that really the sort of person you want as your state representative?” The sort of manipulative FUD that scares people on an issue without actually presenting any evidence.

We get the same sort of thing in storage. It’s not seasonal, though.

For example, there’s been a good amount of FUD about deduplication. I’ve already talked about hash collisions and explained why you’re more likely to spontaneously combust than have a problem there. Another kind of FUD about dedupe has been about the apparent complexity, and the fact that you get space savings. “Jered Floyd says that deduplication saves you time and money while ensuring the integrity of your data, but could it really be that he’s a space alien bent on destroying society as we know it? After all, deduplication deletes data! Is that the sort of technology you want in your data center? Do you really want a space alien telling you what to buy?”

I’m not making this up, although perhaps I am exaggerating a little bit for effect. I really don’t mean to keep picking on Jon Toigo (sorry, Jon!), but he’s brought this up a few times. In one post he says:

Will de-duplicated data […] pass muster with the Fed as “full complete and unaltered data?”

And after I object to the implication that deduplication “deletes” data:

I am not disagreeing with any of your points from a technical standpoint. I don’t think that storage admins should decide what is important from a business perspective, deleting as they see fit electronic files of the company. […] Deduplication has NOT yet been subjected to the acid test of litigation. Nor has lossy compression to my knowledge.

Yeah, and I’m a space alien. Jon’s not the only one who’s brought this issue up, he’s just the most entertaining.

Deduplication doesn’t delete data. Deduplication doesn’t change data. The bits that come out through the standard interfaces — NFS, CIFS and WebDAV in Permabit Enterprise Archive — are the exact same bits you put in. The data is not changed.

Dedupe does not change data any more than compression changes data, or traditional file systems change data. Plain old LZ compression gives you a different output bitstream than what went in, with redundant parts removed, just like deduplication. But when you decompress the file, you get your exact original bitstream back. No information is lost.

Conventional file systems break up files into blocks and scatter those blocks across one or more disks, requiring complicated algorithms to retrieve and reassemble the data. Dedupe is no different. Nonrepudiation requirements are satisfied by the reliability and immutability of the system as a whole, deduplicating or not.

You have the same “problem” with hard drives and tape drives, even. The way magnetic domains get written onto media differ from vendor to vendor and drive to drive. They all use complex PRML and ECC codes, but I wouldn’t accuse them of changing data.

Even scanning in your old paper documents becomes a concern. In fact, that’s far worse than any of the previous examples — when you scan in your documents you choose a resolution, say 300dpi, and the rest of the data really is thrown away! That should scare you more than anything else.

What matters here is the abstraction boundary provided by the user-facing interface, and if that maintains the integrity of the data, then it’s a reliable data storage solution. The bits you send to a hard drive never get written in that exact form on a disk — they get translated into an entirely different bit pattern by an ECC code. The bits that get written to tape with compression turned on aren’t the original source bits. And the bits on the drives in a deduplicating system are not the exact complement of bits sent to it. But, in all three cases, when you use the application interface you always, always, always get back exactly what you wrote in the first place.

Don’t fall for the FUD. Vote Deduplication in 2008!

Advertisements

2 Comments »

  1. […] Floyd, Deduplication is Not a Crime     Comment     RSS Feed     Email a […]

    Pingback by Overheard: Deduplicating data and meeting nonrepudiation compliance requirements - Overheard in the tech blogosphere — September 3, 2008 @ 7:39 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: