Around the Storage Block
1752577 Members
5021 Online
108788 Solutions
New Article
StorageExperts

Re: What’s so great about StoreOnce federated deduplication?

AshwinJ.jpgBy Ashwin Shetty, Product Marketing, HPE Storage

 

It has been more than two years since we announced federated deduplication.This technology addresses many of the shortcomings of other deduplication solutions—such as incompatible deduplication algorithms implemented in software and hardware, comparatively slower restore performance versus backup performance, and inefficient methods of scaling deduplication to meet ever expanding capacity requirements. All of these challenges result in increasing risk, additional costs, and significant management overheads for the user. So, let’s take a closer look at HPE StoreOnce deduplication and the benefits it brings.

 

StoreOnceSlideJ.jpg 

Check out our HP StoreOnce portfolio

 

Federated deduplication

StoreOnce federated deduplication is a technology developed in HP Labs, which can be deployed across the entire storage infrastructure. It provides deployment independence enabling movement of data across various HPE systems without rehydrating the data. Federated deduplication allows data reduction to occur in HPE Data Protector Software, in HPE StoreOnce Backup appliances, in a virtual machine, or as a standalone deduplication API on a server.

 

Optimized data deduplication

HPE StoreOnce deduplication technology includes several innovations from HP Labs:

  • It adopts an approach to eliminate maximum amount of data redundancy along with maintaining a small index to deliver the fastest performance.
  • It converts the backup data stream into a series of chunks and implements them in a variable chunking deduplication algorithm. Along with this algorithm, StoreOnce employs a chunk size of 4K – the smallest in the industry. This provides the ability to match data better, since smaller data chunk sizes lead to a higher percentage of “data matches” resulting in higher deduplication ratios. The 4k-chunk size allows for better alignment with mixed workloads. Most of the vendors in the industry use chunk sizes of 8k, 16k or 32k.
  • It deploys locality sampling and sparse indexing to significantly lower IO and RAM requirements without compromising on performance.
  • It deploys intelligent data matching techniques that enables optimal performance for multiple data types and backup software deployments with no added complexity.

For restoring data, HPE’s approach involves having large-container technology with superior disk layouts. A high degree of fragmentation is avoided by not replacing small amount of deduplicate data with pointers to faraway places with no related data. Data is defragmented after deduplication. This results in faster restoration of data as reconstituting data does not require many slow random seeks.

 

StoreOnce Catalyst

StoreOnce Catalyst is a key component in our federated deduplication architecture that enables deduplication anywhere, rather than specific points in the network allowed by the vendor’s technology. It leverages a common algorithm across the enterprise and allows deduplication at the:

  • Production source or “client”
  • Backup or media server
  • Target appliance

Catalyst provides a single technology that can be used in multiple locations on the network without requiring rehydration when data is transferred between source server backup device and target appliance.

 

Data replication

HP StoreOnce deduplication enables network efficient offsite data replication. All HPE StoreOnce Backup systems use StoreOnce federated deduplication to significantly reduce the amount of data that needs to be replicated, enabling the use of lower bandwidth, lower cost links to transmit data offsite. StoreOnce enabled replication allows cost-effective centralized backup from remote sites or branch offices, and delivers a consolidated disaster recovery solution for the data center.

 

Multiple StoreOnce appliances and virtual machines can replicate to a central StoreOnce appliancewith a fan-in of up to 384 remote offices to a single HP StoreOnce 6500 target delivering greater economies of scale for disaster recovery. HPE StoreOnce VSA can be deployed at:

  • Remote offices: Your remote offices can deploy local backup/recovery solutions using your existing IT infrastructure without any dedicated backup appliance. You can set up the backup server and storage appliance to be delivered virtually. The deduplicated data within the VSA can be replicated from the remote offices to a larger physical HP StoreOnce appliance at your data center.
  • Cloud providers: Subscribers who do not have a deduplication solution can now utilize HPE StoreOnce VSA for high-speed local recovery (with your existing backup software), and then replicate the data via HPE federated deduplication to the service provider’s HPE StoreOnce repository. As a cloud provider, you also have the option of providing individual StoreOnce VSA per subscriber. Each subscriber will have a completely autonomous deduplication appliance within the cloud without the need to replace backup software.

Meeting big data requirements

By combining federated deduplication with our unique scale-out cluster architecture, we deliver the industry’s only large-scale deduplication appliance with fully automated high resiliency features such as high availability (dual controllers) and autonomic restart of failed backup jobs. Our solution is designed to simplify and speed up Big Data backup with scalable capacity and performance.

 

All that—and a deduplication guarantee too

We are so confident about our federated deduplication technology that when companies move from legacy storage backup system to any HP StoreOnce Backup solution, we are willing to offer a 20:1 deduplication guarantee thanks to our HP StoreOnce Get Protected Guarantee Program. This reduces the amount of stored backup data by 95% or HP will make up the difference with free disk capacity and support.

 

About the Author

StorageExperts

Our team of Hewlett Packard Enterprise storage experts helps you to dive deep into relevant infrastructure topics.

Comments
Ravindra Tumu

I would like to know more about this Deduplication which HP StoreOnce 6600 use.

What file Types we get more Deduplication and which backup software you found good deduplication ratio?

Please let us know on all these.

Thanks,
Ravindra

Thanks for reading the blog - and for your questions.

StoreOnce deploys the same deduplication technology across the portfolio including StoreOnce 6600. As to your other questions...

What file Types we get more Deduplication?

This largely depends on the customer’s use of the data and internal policies.

 Deduplication comes for 3 attributes:

  1. The retention of the data across multiple backup jobs (effected by the change rate).
  2. The proliferation of the same (or similar) data within a backup job.
  3. The nature of the data format to keep similarities at a block level.

For 1:

If you have data that remains the same in a backup job, and you backup that data for 30 days retention, that data is seen 30 times (30:1). If however the same backup was retained for 90 days and is unchanged then that data is seen 90 times (90:1). So the longer the retention, the better the dedupe ratio.*

This is only true if the data remains relatively similar across backups. Highly transactional data (e.g. a DBs that sees 50% of its content change every hour) backed up nightly and stored for 180 days…still will not dedupe well.

*Unless the data is extremely static (like an archive) this doesn’t hold true indefinitely. Generally, dedupe won’t get better from retaining data for 5 years rather than 3 years (for example) because there will usually be some dramatic changes in such a long retention. E.g. Everyone migrates to a new version of office which saves the files in a new version of the format. While the doc content may be very similar the blocks have changed dramatically due to the new format.

For 2:

Sticking with .doc as an example. Compare the backup of user shares containing word docs and an insurance company’s standard claim forms that have been completed electronically and stored in a massive repository.

The user shares might have some versions of docs that are similar, or even some that are the same across different user shares. However, across the millions of files the ‘hit’ rate will be comparatively small.

The insurance company’s standard claim forms will each be unique however will have large portions that are identical, across potentially tens, or even hundreds of thousands of files.

Therefore even though the file format is the same, and possibly the number/size of the files are the same in both cases….the insurance company will see much better dedupe. 

For 3:

In some formats, a small change at the user level (e.g. changing the colour of a single word on one slide in a PPT deck of 50 slides) can have a disproportionate change at the block level (it would be reasonable to assume this might only change a few blocks but in some cases changes hundreds of KBs or even MBs of data within the file at the block level). So looking across formats some are friendlier than others.

In general terms:

VMs dedupe extremely well (all those OS images and application files which give you high hits in 1 & 2 above)

Databases – depends on change rate/format but tend to be ‘ok’.

Office files – as mentioned in above: Highly dependent on change rate (1) and use case (2). Can be great or terrible.

Pictures/Videos – tend to dedupe and compress very poorly as they are already compressed and that compression will make 2 files that visually look similar, different at a block level (that said I have seen some users get acceptable dedupe with static CCTV camera footage for example as so many of the files contained frames that were identical).

Encrypted data – tends to dedupe and compress very poorly as the whole point of the encryption is that the same blocks should not result in the encrypted block (so there should be no match for dedupe). However, it depends where the encryption occurs:

  • A lot of people get confused about Self Encrypted Drives, or Encrypted Networks. These are transparent to applications like backup apps and dedupe and have zero impact on the dedupe.
  • ‘Encryption’ is a very broad term and encompasses many different options on how broad the encrypted content is. E.g. Encrypting a database with encryption that spans the entire DB with a single encryption key and output would result in zero dedupe (any change results in the entirety changing). However, SQL Encryption, for example, encrypts at a page level; so unchanged pages do not change between backups, resulting in some dedupe.

Hope this helps, there is no simple answer as the dedupe ratio achieved depends on multiple factors (unique to a customer’s environment). However there are a few things with HPE StoreOnce that can help:

We have a number of tools that are free that can analyze a backup environment and estimate the dedupe ratio (The Ninja Tools).

We have the StoreOnce VSA that uses the same dedupe algorithm as the appliances – therefore the dedupe ratio you achieve with VSA will be the same as a 6600. VSA can be downloaded and tried for free for 60 days.

We have a guarantee program that if HPE assesses your environment we will guarantee space saving with dedupe (Get Protected Guarantee).

What Backup Apps give better deduplication?

In terms of Backup ISV, the vendors that have integrated with Catalyst tend to get the best dedupe ratios. Today those would be:

Data Protector

Veeam

Netbackup

Backup Exec

Bridgehead

However, there is also an argument that ‘good’ is relative. For example, Spectrum Protect users frequently tell us that they achieve better dedupe with StoreOnce what with other dedupe solutions. Even though Sprectrum Protect doesn’t give as high dedupe as say Data Protector, StoreOnce is still a great solution for Spectrum Protect users because it is better with Spectrum Protect than other solutions.