Permabits and Petabytes

October 14, 2008

Dedupe Ratios: Fact and Fiction

Filed under: Jered Floyd — jeredfloyd @ 5:13 am

Greetings, everyone. I’ve been back from my trip for a while; I apologize for the lack of posts! I’ll get some pictures up soon, although the ones in the official gallery are surely better. I’ll post some of the favorite things I saw later.

Over at his blog, Scott Waterhouse (from EMC) takes to task backup vendors that claim outrageous dedupe ratios. Now, this is just a wee bit entertaining given that EMC Avamar has been claiming deduplication ratios of 300:1 for many years now, but his point is absolutely valid.

Vendors of deduplication in the backup space — this is primarily VTL although there is a fair amount of disk–to-disk out there too — are in the business of getting you to buy their storage. These backup appliances aren’t actually all that cheap. In general they run $20/GB and up in terms of raw storage cost, and potentially a lot more if they’re gateway products writing to a pricey SAN. That doesn’t sound that exciting for replacing $1/GB (or less) tape, even though tape has numerous problems.

To avoid this, every backup-to-disk vendor that I’ve seen only talks about “effective costs” and “effective capacity,” using whatever magic ratio that they find appropriate. And unlike car mileage, where there are standard EPA guidelines for “city mileage” and “highway milage”, deduplication ratios seems to be largely based on coasting downhill all the way… and so we end up with vendors selling backup appliances with 30 TB of disk inside, yet all the marketing materials read “stores 1 petabyte!”

One part of the solution to this will be coming up with a set of tests for standardized conditions — city data and highway data, as it were. A good spot for this to happen is in the SNIA Data Deduplication and Space Reduction Special Interest Group (DDSR SIG), although right now that group is caught up in agreeing on basic terms. (I love SNIA and am active within the organization, but being so vendor-driven has its share of problems. Every document produced has to be audited by the marketing departments of all member companies, and if any of them think something makes their product look bad… let’s just say it’s going to be a while before we publish a document that says “product X works better than product Y.” How to fix this problem is perhaps a topic for a future day.)

Standardized dedupe ratios will be a long time coming, so we’ve taken a different approach at Permabit. We only talk about the amount of real, usable disk that you get in Permabit Enterprise Archive. If you order 100 TB, you get 100 TB of disk. No dedupe assumed, no compression assumed; all of that is gravy. That will always cost less that $5/GB, often a lot less (contact our sales team to find out how much less!)

An archive storage solution has to be cost effective from the start because, as with backup, deduplication ratios can vary greatly. We have some customers who see 20-30% savings on basic office productivity documents, others than see 10x savings on database backups, and still others than see 300x savings on virtual system images! I wouldn’t necessarily call any of these representative, but all reflect significant additional cost savings beyond the already low TCO that Enterprise Archive delivers. These are all without counting the fact that snapshots are essentially free, even though some other vendors have started including that as “deduplication”.

We think that if you’re looking to save money, waiting until your data gets into the backup stream is too late in the process. That data is occupying expensive primary disk, and on top that it’s getting backed up repeatedly onto tape or VTL. If the primary storage is costing you $30 and the total cost of backup is $3, cutting that $3 to $1 just isn’t saving you that much.

Instead, much of the primary data can be moved to an archive tier of storage. Not only does that eliminate the need for primary storage growth, but with built-in replication from a product like Enterprise Archive there’s also no longer any need to back it up at all. You’ve just cut that $30 to $5, and the $3 to zero! We recently published a simple ROI calculator that lets you see the savings for yourself, based on your data. Go try it out.

Dedupe ratios may be confusing, but the hard dollars and cents of raw storage are not. Contact our team if you’d like to find out more — they have a tool that will allow you to determine the deduplication ratio that you’ll see on your own data, so you’ll know in advance exactly how much you’re going to save.

1 Comment »

  1. The 300:1 number is a LAN/WAN transmission reduction number not a storage capacity reduction number.

    It’s derived from the amount of data which doesn’t have to be sent due to source based de-duplication and the fact the Avamar Data Store is global so all unique objects are shared amongst all clients.

    If you have unique objects on a host post de-dup that doesn’t mean they’re not already in the Data Store and if they are it means they don’t have to be transmitted.

    Marketing might chose to hype the number one way, I explain it in context the right way.

    And that 300:1 is an average, I’ve done more than 500:1 in the lab.

    Comment by Storagezilla — October 14, 2008 @ 8:13 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply to Storagezilla Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: