Permabits and Petabytes

August 19, 2008

Deduplication is Not a Feature

Filed under: Jered Floyd — jeredfloyd @ 8:23 pm

I’ve been writing an awful lot about deduplication lately, how it works, how it doesn’t, and how Permabit does it. I’ve been drumming it up a lot, so now I’m going to turn the tables and say something different: Deduplication doesn’t matter.

No, I’m not contradicting myself.

When you set out to buy an archive storage product, there are things that are features and things that are product characteristics. Examples of features are NFS protocol interface, unlimited volume size, low cost, and comes in blue, red or black. Examples of characteristics are Intel processor, SAS drives, number of gigabytes of RAM and, yes, deduplication.

These look like similar lists; what’s the difference?

Anyone in product management will tell you that the hardest thing about defining a product specification is getting the real requirements from the customer. People are very good at correlating characteristics they’ve seen before with qualities that they want in a product, and will mistakenly ask for the characteristic instead of the quality, or feature, that they want. This can result in disasters such as products that meet the specification 100%, but don’t solve the underlying problem.

Looking at the first list again, the requirement for “NFS protocol interface” probably comes from the desire “works with my existing software on my application servers”, and those application servers probably support NFS. NFS support is a feature. Similarly, I might need “works with my existing software without continual configuration changes”, which may lead to the requirement for “unlimited volume size”. These are clear features that support business requirements.

Reviewing the second list, however, what business driver would lead to a requirement for “uses an Intel processor”? Unless I have a business agreement with Intel, it doesn’t really matter if my storage appliance uses an Intel processor, an AMD processor, or a SPARC. The real requirement is probably “meets my performance needs”, and the customer has been conditioned to think that Intel processors, or SAS disks, or 32 GB of RAM are more likely to make that the case. But a system with those characteristics might just as easily fail to perform, while a system with different characteristics might well exceed the requirements. The job of a product manager is to extract the real requirements so that the product fulfills the customer needs.

Which brings us to deduplication. Deduplication is not a feature. it does not satisfy any underlying business requirement. The real requirement is the reduction of cost.

Permabit Enterprise Archive provides the most scalable, sub-file data deduplication because that technology helps to reduces the cost per-gigabyte of storage for your business, while maintaining outstanding performance in terms of reliability, availability and scalability. That’s the reason we built in dedupe.

Data deduplication reduces cost in a number of different ways, and this has driven many vendors to scramble to find ways to integrate it into their products. Capital costs are clearly reduced because there’s less physical storage that must be purchased. Just as important are the operational cost savings: fewer drives to spin and cool for the same amount of data storage. Deduplication is an inherently “green” technology. Additionally, deduplication provides greater effective density, reducing rack space requirements in the data center.

There are other technologies that can reduce the cost of data storage, but only deduplication offers the potential to drive down the effective cost of online archive storage to below that of the hardware itself. That’s the real reason deduplication has so rapidly taken off as a technology in the marketplace, and one reason for the massive success of newer deduplicating VTL systems. We’re convinced that the next market to strongly realize the cost benefits from deduplication technology is the large-scale archive market, and that’s why we’ve built it into the core of our appliances, designed from the ground up to deduplicate data on ingestion.

The other lesson to learn from this is: don’t buy expensive dedupe! Expensive storage with deduplication is still expensive storage, and doesn’t meet the real customer requirement — cost reduction. Deduplication technologies should never be considered just a checkbox that needs to be on the data sheet for your next storage system; it’s only useful if it’s saving you money today.



  1. Hmm, it seems like deduplication changes not just the cost per bit but the cost function — if you have dependable deduplication in the storage system, you may be able to simplify the software that runs on top of that storage system, because it doesn’t have to go to great lengths to avoid storing 100 or 1000 or 10 000 copies of the same gigabyte-sized file, because the copies don’t cost extra.

    So it’s still cost reduction, but it’s cost reduction like switching from bubblesort to mergesort, not cost reduction like switching from heapsort to quicksort.

    All this is hypothetical though; the only thing I’ve used that does automatic dedupe is `git`, and I haven’t built any software on top of it that benefits from it.

    Comment by Kragen Sitaker — September 17, 2008 @ 12:23 am

  2. Kragen,

    Interesting point. Many applications today try to do deduplication within the domain of their own data — mail servers are a good example here. Microsoft Exchange won’t store multiple copies of a message that was sent to multiple recipients on the same server; Cyrus will. Of course, this makes Exchange a more complicated piece of software.

    Given than many kinds of applications was deduplication services, it makes sense that this could eventually get pushed down into the OS level, where it can just be assumed to exist. Bottlenecks to performance (be it FLOPS or dollars) tend to go down this route — hand-coded assembly will rarely do better than the compiler today except in exceptional cases, file writing algorithms that try to second guess the block cache don’t help anymore because the drives have gotten smarter.

    Perhaps in the future we can assume dedupe exists, and reduce some repeated complexity across applications.


    Comment by jeredfloyd — September 17, 2008 @ 2:49 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: