Around the Storage Block
1822768 Members
4095 Online
109645 Solutions
New Article
StorageExperts

Achieving data reduction and data protection efficiency with file storage

Today’s AI and other data-intensive apps process vast amounts of file data. Learn how HPE GreenLake for File Storage powers these apps with unprecedented data reduction and data protection efficiency.

–By David Yu, HPE Storage Product Marketing

At the leading edge of technology today, AI and other data-intensive applications process huge volumes of file data. Because of the enormous storage footprint of that data, efficient data reduction and data protection are crucial. Without them, storage capacity demand and costs can skyrocket and severely impact the business.  

HPE GreenLake for File Storage is designed to eliminate data efficiency issues. In a previous blog in our series, we provided a high-level overview of the solution’s efficient data reduction and data protection algorithms. We’ll now take a more detailed look at how HPE GreenLake for File Storage achieves its remarkable dataHPE File Storage BLOG.png reduction and data protection efficiency.

Understanding what we mean by storage capacity

HPE GreenLake for File Storage offers storage starting at around 220 TB of usable capacity, with the ability to grow into a significantly larger footprint. Total capacity always excludes the storage class memory (SCM) drives used to store metadata and stage data. SCM is non-volatile random-access memory (NVRAM) that ensures there will be no metadata or data loss and that data will be eventually safely stored on SSD media. It’s important to understand the terminology and definitions associated with capacity statements – particularly those factoring in data reduction:

  • Raw capacity refers to total aggregate capacity across NVMe drives, without factoring in space reservation, erasure coding, or data reduction.
  • Usable capacity is the capacity available after calculating erasure coding and space reservation. It does not factor in any data reduction calculations.
  • Effective capacity is the capacity after factoring in all data reduction calculations on the array.

To maximize effective capacity and reduce overhead – and reduce physical storage costs, save energy, and decrease datacenter footprint– it’s important to have efficient data reduction and data protection algorithms in place.

Layered techniques provide efficient data reduction

HPE GreenLake for File Storage delivers significant advancements in data reduction. Over the past few decades, most if not all storage solutions have provided deduplication and lossless compression on data sets within the storage subsystem. HPE GreenLake for File Storage goes further by offering a unique approach to data reduction that applies the following techniques across a single global namespace to achieve superior data reduction ratios.

Filel Storage 1.png

 The data reduction workflow

Data reduction is an inline process in HPE GreenLake for File Storage. During data migration (commonly referred to as de-staging) from SCM to SSD, the data pipeline goes through a process called adaptive chunking. Then the global deduplication engine works to compare these data blocks with the entire namespace. Next, data reduction via the Similarity algorithm finds data blocks that are similar but not identical to reduce the data footprint even further. Finally, local compression kicks in. Even though the data reduction technologies used are all block-based reduction methods, there is no fixed block size as in traditional block storage.

Let’s look at each of these data reduction techniques in turn.

Adaptive block size chunking

Adaptive block size chunking takes advantage of a sliding block size to maximize the opportunity to perform Similarity reduction and deduplication. With an adaptive block size ranging between 16 KB and 64 KB, the algorithm adjusts block size based on the likelihood of its yielding optimal data reduction. As a result, a file that is edited or modified at a later stage has minimal impact on the block set for the current file and still achieves optimal data reduction versus fixed block chunking, in which the entire set of file blocks is mismatched from their original state after an edit operation.

Global deduplication

Each data block generates a cryptographic hash, and the hash values of all the data on disk are stored in SCM. Before the data is written to the NVMe SSD layer in the migration process, and after the data passes through adaptive block chunking, the algorithm compares data blocks against all existing hash values on the entire namespace. If a match is found, the data is stored as a pointer to the duplicate block found in the existing data. This deduplication process can be far more effective than local deduplication, because the shared-everything architecture gives all storage nodes access to the entire data set without the tradeoff of controller node cross-traffic consuming CPU cycles.

Similarity reduction

Instead of finding an exact match for a block to deduplicate, Similarity reduction compares a calculated similarity hash to find blocks that have high similarity despite small differences. If two blocks are sufficiently similar in content, i.e., by having a matching similarity hash, they are stored as pointers to deduplicated baseline block and byte-level deltas.

File Storage2.png

Similarity reduction

 As covered in a previous blog, data reduction with Similarity is superior and delivers significant benefits in the unstructured world of files, where much data is similar but not identical.

Compression is fine-grained but local: You are reducing redundancy over a small piece of data. It’s also computationally intensive. Traditional deduplication is global over a large amount of data but very coarse. Data is broken up into chunks of the same block size, and the system looks for exact matches. Similarity is more flexible, looking for data that is similar but not identical on both a global and fine-grained basis.

Consider the example of DNA data gathered for life sciences research. A DNA strand will have information that is mostly the same, but there will be differences with varying degrees of uniqueness. With a deduplication algorithm looking for identical matches, similar data chunks would still be considered mismatches. However, Similarity will spot blocks of data that are mostly the same, compress them together, track the changes between them, and store only what is common. This gives you the best of both compression, which is fine-grained but local, and deduplication, which is global but coarse.

To illustrate Similarity’s superior data reduction more simply, here is a before and after comparison of a data footprint using Similarity vs. compression and deduplication:

File storage3.png

Before

FileStorage4.png

 After – Capacity savings from Similarity

 Local compression

A lossless compression algorithm, Zstandard (ZSTD), is used to compress data blocks to reduce block size. ZSTD is a fast lossless compression algorithm that targets real-time compression scenarios and better compression ratios.

Unpacking efficient data protection

HPE GreenLake for File Storage leverages erasure coding for efficient data protection. Erasure coding on SSDs can be understood simply as wide striping and parity redundancy. An erasure coding stripe width is the number of drives a stripe is spread across. Wide striping decreases the overhead percentage and is therefore more advantageous with a large configuration: larger configurations can take advantage of an overhead as low as 2.7%.

With a design that includes no fixed grouping size, we get easy integration of varying drive sizes and expansion, as well as easy failure handling in the event of drive failure. As the solution expands with more storage nodes, the new configuration determines the new larger stripe size, and previously written stripes with the smaller stripe size are gradually copied and rewritten with the new stripe size across the new configuration.

Although the erasure coding algorithm used in this solution provides rigid data resiliency with wide striping, the locally decodable aspect of the algorithm offers an additional advantage. The system can reconstruct a corrupted fraction of the stripe without having to read across the entire wide stripe. Typically, only one quarter of the data must be read during a reconstruction.

Achieving data reduction and data protection at scale

With HPE GreenLake for File Storage, not only can you deliver enterprise performance at scale for AI and data-intensive apps, but you can also drive cost savings via efficient data reduction and data protection. This makes a significant difference, not just in terms of investments in capital equipment, but also in terms of sustainability. With the huge volumes of file data that modern apps process, the data reduction and data protection efficiencies at scale of HPE GreenLake for File Storage deliver substantial savings in data center footprint, power, and cooling. For all of your modern applications, including AI, HPE GreenLake for File Storage is the ideal solution for your file storage needs.

Want to learn more?

Read the other blogs in our file storage series:

Dig deeper on the topic

 Read:

Watch:  HPE GreenLake for File Storage technical demo


Meet Storage Experts blogger David Yu, HPE Storage Product Marketing

David Yu_Picture_Cropped.pngDavid plays a key product marketing role in HPE’s storage business, covering areas such as file-and-object storage, scale-out storage, cloud-native data infrastructure, and associated cloud data services. Connect with David on LinkedIn.

 


Storage Experts
Hewlett Packard Enterprise

twitter.com/HPE_Storage
linkedin.com/showcase/hpestorage/
hpe.com/storage

0 Kudos
About the Author

StorageExperts

Our team of Hewlett Packard Enterprise storage experts helps you dive deep into relevant data storage and data protection topics.