Around the Storage Block
cancel
Showing results for 
Search instead for 
Did you mean: 

Myth Busting: 8 Common Data Deduplication Misconceptions

StorageExperts

 

It’s time to break through the myths and explore some very real truths around deduplication for data storage.

Data deduplication technology has been around for a long time but has undergone a bit of a resurgence lately as more storage vendors added this feature to their hardware and software products. But simply having a data deduplication feature doesn't mean you'll use it well. Even many experienced storage administrators and architects take some common misconceptions for truths.

Whether you are a system architect, planning staff, procurement person, or IT operations staff, and whether your data is on primary midrange and enterprise storage, archival storage or all-flash and hybrid storage, you need to understand the basics—and pitfalls—of deduplication schemes. 

Data reduction ratios: keeping it real

While deduplication is available for primary and secondary data deduplication BLOG.jpgstorage, the data footprint reduction ratios you can achieve differ greatly. People frequently fall into the trap of assuming that what they can achieve on a deduplication storage system is the same as what they can get on a primary array.

Deduplication is automatic. But the potential data reduction ratios you can achieve differ. For example, if you need to store 100TB of data, it makes a huge difference if you assume a 10:1 ratio and buy a 10TB device or if you assume a 2:1 ratio and buy a 50TB device instead. You must have a good idea as to what's achievable before you buy. 

Having spent an extensive amount of time designing backup environments and working on deduplication on primary arrays, I have come across many misunderstandings about proper use. If you are using the technology in your environment or are involved in architecture designs and sizing that include deduplication technology, this discussion is for you.

Understanding these eight misconceptions about deduplication can help you more confidently deal with deduplication-related questions and better estimate what a realistic ratio should be for your environment.

1. Higher deduplication ratios yield proportionally larger data reduction benefits.

If one vendor promises a 50:1 deduplication ratio, is that five times better than another vendor's 10:1 claim? Deduplication is all about reducing your capacity requirements, what are the potential capacity savings? A 10:1 ratio yields a 90% reduction in size, while a 50:1 ratio bumps that up to 98%. But that's only a 10% difference.

In general, the higher the deduplication numbers go, the lower the data reduction benefits—the law of diminishing returns. 

DedupReturn.jpg

2. There is a clear definition for the term “deduplication."

Deduplication is about reducing the amount of data stored by removing duplicate data items from the data store. This can occur on an object/file or physical data block level, or it can be application- or content-aware. Most products combine deduplication with data compression to further reduce the data footprint. While some vendors combine the two, others call them out separately or coin terms such as “compaction,” which is just a fancy way of saying "deduplication with compression." Unfortunately, there is no single, all-encompassing, widely accepted definition of deduplication.

Data deduplication technology has been around for a long time but has undergone a bit of a resurgence lately as more storage vendors added this feature to their hardware and software products. But simply having a data deduplication feature doesn't mean you'll use it well. Even many experienced storage administrators and architects labor under several common misconceptions. 

Whether you are a system architect, planning staff, procurement person, or IT operations staff, and whether your data is on primary disk storage, archival storage, or all-flash storage arrays, you need to understand the basics—and pitfalls—of deduplication schemes. 

3. Deduplication ratios on primary storage are similar to those achievable on backup appliances.

Storage vendors use many different deduplication algorithms. Some are more CPU-intensive and sophisticated than others. It should come as no surprise, then, that deduplication ratios differ widely.

However, the biggest factor affecting the deduplication ratio you'll achieve is how much data you have that's identical or of a similar type. For that reason, backup devices, which hold multiple copies of the same data in weekly backups, almost always show higher deduplication ratios than do primary arrays. You might keep multiple copies of data on your primary arrays, but as these snapshots tend to be space-efficient, the arrays will inherently implement a kind of deduplication. That's why primary storage deduplication ratios of 5:1 are about as good as it gets, while backup appliances can achieve 20:1 or even 40:1, depending on how many copies you keep.

4. All data types are equal.

As should be clear by now, this is patently false. For example, data types that contain repetitive patterns within the data stream lend themselves to deduplication. The deduplication ratio you can achieve depends on several factors:

  •  Data type—Pre-compressed, encrypted, meta-data rich data types show lower deduplication values.
  •  Data change rate—The higher the daily change rate, the lower the deduplication ratio. This is especially true for purpose-built backup appliances (PBBAs).
  • Retention period—The longer the retention, the more copies you'll have on your PBBA, raising your deduplication ratio.
  • Backup policy—A daily full backup strategy, as opposed to incremental or differential one, will yield higher deduplication ratios because much of the data is redundant.

The table below provides a rough overview of data compaction ratios. What deduplication ratios can realistically be expected on a PBBA? Remember that ratios on primary storage will be considerably lower.DedupRatios.jpg

5. Grouping dissimilar data types increases your deduplication ratios.

In theory, if you mix different data tapes into a huge deduplication pool, the likelihood of finding identical blocks, or objects, should increase. However, the probability of that happening remains low between dissimilar data types, such as databases and Exchange email. So increasing your deduplication pool comes at the cost of more complex and time-consuming hash comparisons and the like. You're better off separating deduplication pools by data type. Of course, going wide within a given data type can give you a substantial increase in deduplication ratios.

For example, if you perform deduplication within a single virtual machine (VM) image, you will get one ratio, but if you target multiple copies of the same VM image (e.g., by performing daily backups of that VM to a deduplication store), your ratio will increase. Combine 50 VMs into the same store and, since those VM images are likely to be very similar, you'll improve your ratio even further. The wider you can go with your deduplication pool within a single data type, the better.

6. Your first backup will show your predicted deduplication ratio.

This misconception comes up in discussions of relative deduplication ratios on primary storage versus backup appliances. If you hold one copy of the data for a given application, or virtual machine or the like, you'll see some deduplication and compression. But your ratio will only soar when you keep multiple copies of very similar data, such as multiple backups of the same database.

The figure below shows a very typical deduplication curve. This one is for an SAP HANA environment, but most application data follows the same curve. Your initial copy, or backup, shows some deduplication benefits, but most of the savings are due to data compression. As you retain more copies, however, your deduplication ratio for the overall store will increase, as shown by the blue line. The ratio for an individual backup (orange line) skyrockets starting with the second copy.DedupCurve.jpg

7. You can't increase deduplication ratios.

It would be naive to believe that there is no way to artificially boost deduplication ratios. If your goal is to achieve the highest possible ratio, then store as many copies of your data as possible (long retention times). Your actual stored capacity on disk will increase as well, but your ratio will soar.

Changing your backup policy works as well, as shown in the real-world example below, which compares daily full backups with weekly backups combined with either daily incremental or daily differential backups. In this case, a daily full backup policy drives the highest ratio. However, the actual space used on disk is similar with all three approaches. So be wary when a storage vendor promises extremely high deduplication ratios, since a change in your backup schedule might be required to achieve it.

8. There is no way to predetermine deduplication ratios.

Every environment is different, so it's hard to accurately predict real-world deduplication ratios. However, vendors do offer primary storage/backup assessment tools that are slim to run and provide insights into data types, retention periods, and the like. These tools typically allow a somewhat accurate prediction of achievable deduplication ratios.

Also, vendors have information about the ratios their installed base has achieved, and they can even break that down by industry segment. While there's no guarantee that you'll see the same benefits, it should provide some piece of mind. And if piece of mind isn't enough, ask the vendor for a guarantee. Some vendors do offer deduplication guarantees under some circumstances.

Finally, a proof-of-concept conducted on a representative subset of your data will provide even more accurate estimates.

DedupeTable.jpg

Ready, set, start your deduping

There is no magic behind deduplication, but now that you understand the basics, you should be well equipped to maximize the effectiveness of deduplication technology on your storage arrays and appliances.

Let me know what sorts of ratios have you achieved on your data.

   

Tilman Walker.jpegMeet Around the Storage Block blogger Tilman Walker, Manager, Technical Marketing Engineering, HPE Storage.

About the Author

StorageExperts

Our team of Hewlett Packard Enterprise storage experts helps you to dive deep into relevant infrastructure topics.

Comments
Howard Marks

I have to pick a bone with your math in point 1.  The problem is you're compariing percentages as if they were absolute values. A percentage is a meaninless number if you don't specify what the percentage is OF.

You say 50:1 is a 98% savings and 10:1 is a 90& savings and then say but the difference is only 10%. Leaving aside that it's actually 8% thats 8% of what? The savings, that makes it sound like at  10:1 I'll only need 8% more disk or SSD space than at 50:1 when I'll actually need 5X as much disk.   1TB at 10:1 reduces to 100GB at 50:1 it reduces to 20GB so the difference in the thing I pay for, and provide space in my rack for, isn't 8% it's 5X. 

HPEStorageGuy

Howard! Thanks for reading the post and stopping by to pick bones. I saw your comment and told Tilman that I'd reply to you. Other than the error around 8% and 10%, both you and Tilman are correct. I think the first graph he shared tells the story about the incremental savings but I created another view to make the point. 

dedupe and incremental.png

One thing to point out is that my scale is not consistent. I started at 2:1 deduplication to 10:1 and then used increments of 10 from there. The percentage of incremental savings from 10:1 to 50:1 is 8%.  But you're also right that there's 5X more raw data.

I think using an example helps make it clear.  If we have 100TB of backup data and we have 10:1 deduplication, we're have 10TB of data. If we're getting 50:1 deduplication, we'd have 2TB of data. With 100TB as the start point, the percentage difference between 10TB or 2TB is 8%.  It's also 5X more data.  

Let's look at it from a cost perspective.  let's say that storage cost is $1/GB (or $1000/TB) and we'll use the same 100TB of data. Here are some different cost points:

  • 1:1 dedupe ratio: 100TB X $1000/TB = $100,000
  • 2:1 dedupe ratio: 50TB X $1000/TB = $50,000

So in this range, doubling of the dedupe ratio results in costs cut in half. The customer would save $50,000.

Now let's look how this changes at higher dedupe ratios:

  • 10:1 dedupe ratio: 10TB X $1000/TB = $10,000
  • 20:1 dedupe ratio: 5TB X $1000/TB = $5000

Again, we cut costs in half but the absolute savings is only $5000. 

And since we talked about 50:1, here's what that looks like.  

  • 50:1 dedupe ratio: 2TB X $1000/TB = $2000

Why is that important for a customer? If one vendor promises 1:1 dedup and another 2:1 then that will make a substantial difference for the customer when purchasing the solution. But if one vendor offers 10:1 and another 20:1 then that cost differential will be considerably less and might not be enough to differentiate between the two vendors.

Again, thanks for picking a bone!