Around the Storage Block
Showing results for 
Search instead for 
Did you mean: 

Big Data update: Hadoop 3.0 HDFS erasure coding performance




Big Data HDFS EC performance_blog.jpgThere’s no question that Big Data is getting bigger every day. According to various reports, 90% of the today’s data was created in the last 2 years [1]. Because of this, for most Big Data customers, $/TB is the most important purchasing decision factor. In this blog, we present a way of combining our HPE Elastic Platform for Analytics (EPA) Hadoop architecture with Hadoop 3.0 HDFS erasure coding and hardware compression to provide an increase of >6x in HDFS usable space, without sacrificing performance or cost in order to obtain it. This in turn ends up lowering $/TB metric by >6x.

It is said that the three most important factors for a Big Data platform are storage cost, storage cost and storage cost. While in truth other factors are important, customers are acutely aware of the need for cost-effective storage given the volumes of data being retained. In fact, the early genesis of Big Data was the creation of the Google File System as a means to reliably manage large volumes of data more cost effectively than traditional storage arrays. This is exacerbated by the fact that most Big Data solutions store data using replication where 3 copies of the data are maintained. While this can improve read throughput it is primarily being done for fault tolerance. It is very unusual to have any kind of a backup solution on a big data platform not only for cost reasons but because the data is often arriving at a rate that makes it infeasible to backup. For this reason, Big Data is at odds with itself. Storage cost is paramount but data durability requires us to overprovision our storage by 3x. And as Big Data becomes more structured it will have more auxiliary structures such as indices and aggregates which will increase write activity yet triple replication means 3x the writes.

Our solution to improve the storage efficiency and to lower the $/TB metric is to leverage the HPE EPA Workload and Density Optimized (WDO) Big Data architecture, combining the Hadoop Distributed File System (HDFS) Erasure Coding (EC) feature introduced in Hadoop 3.0 with transparent hardware compression resulting in > 6x more HDFS usable space and lowering the $/TB metric by >6x. Find more details about HPE EPA architecture here:

This is a diagram with the building blocks used to create a HPE EPA WDO solution. The most important aspect is separating storage and compute resources on different physical servers.

Accelerating Big Data.jpg

The default HDFS replication scheme uses 3 replicas thereby incurring 200% overhead whereas a typical EC scheme has an overhead of only 40-50% and can store double the amount of data compared to a 3x replication scheme. This savings alone is beneficial but in addition, by making use of a hardware compression solution from MaxLinear (Model DX2040 [3]), you can gain another 3x improvement in storage efficiency for a total of >6x improvement. The processing overhead to implement the EC encoding/decoding is mitigated in the solution by making use of Intel’s open source Intelligent Storage Acceleration Library (ISA-L) which exploits advanced processor instruction sets. The hardware compression solution seamlessly integrates into the HDFS stack and provides automatic compression of all files in a transparent manner.

In addition to the storage efficiency, erasure coding also improves performance by reducing the amount of I/O performed by applications. The solution also makes use of HPE’s innovative WDO architecture for Big Data which lends itself very well to address the drawbacks of EC scheme over replication, such as remote I/O and the potential interference of EC calculations with compute intensive workloads, thus allowing us to use this solution to store all HDFS data in a compressed form, not just COLD data.

Erasure Coding schemes have been used in other storage solutions, including RAID, to provide high availability while being more efficient with respect to storage space. EC schemes encode the data with additional data, i.e. parity data, such that the original data can be reconstructed using the parity data in case of failures. There are number of EC schemes available, Reed-Solomon (RS) being one of the popular schemes. RS uses sophisticated linear algebra operations to generate multiple parity data blocks and can tolerate multiple failures.

Our solution makes use of the EC feature being introduced in upcoming Hadoop 3.0 release. For example, a RS(6,3) scheme will generate 3 parity blocks to encode 6 data blocks and provides protection against 3 failures. Thus, by using EC, our solution gains 2x usable HDFS space. Also since the same data is now stored on 9 disks (6+3) instead of 3 disks for replication=3, we get more parallelism to disks which increases performance.

The drawback of an EC scheme is that it needs compute cycles to encode the data while writing and decode it while reading. To address this issue, our solution makes use of the open source ISA-L library which accelerates the linear algebra calculations by making use of advanced instruction sets e.g. SSE, AVX, and AVX2 available for Xeon E5-2660v4 processors used in Apollo 4200 storage nodes. The ISA-L EC implementation outperforms the Java implementation by more than 4x [2]. Because of this acceleration and due to the increased I/O parallelization mentioned above, erasure coding outperforms standard Hadoop replication.

To improve the storage efficiency of HDFS further, we make use of hardware compression technology which has the advantage of offloading the host CPU to accelerate the compression/decompression tasks. We are currently making use of PCI-e cards from MaxLinear (Model DX2040 [3]) which can provide up to 5 GB/sec throughput. The MaxLinear solution provides a transparent Linux file system filter driver that sits below the file system to automatically compress/decompress all files. The storage space gains from this hardware compression technology can vary depending on the data being used, but we can expect a typical storage efficiency of ~5x with Hadoop data sets. The MaxLinear solution can also be used for encrypting sensitive Hadoop data at the same time as compressing the data with no performance penalty. This is an area for future exploration. It is to be emphasized that our solution to improve the storage efficiency is transparent to Hadoop software components and does not require any application changes to reap the benefits of the solution. Moreover, since HPE EPA WDO uses separate storage nodes for HDFS, the hardware compression cards are needed only in those nodes, which provides cost savings.

Finally, the most important element of our solution is the HPE EPA WDO architecture for Big Data. First, by disaggregating Storage nodes from Compute nodes, it allows us to add the hardware compression cards to improve storage efficiency only on the Storage nodes instead of on all nodes in traditional Hadoop architecture. Next, the EC striped block layout scheme is exploited to the fullest because of the increased I/O parallelization due to the “storage dense” nature (i.e. lot of drives working in parallel) of the WDO architecture. Further, our solution eliminates a major objection of current EC approach of striped block layout compared to the replication model i.e. that the EC approach foregoes the locality advantage of the replication model. In a WDO architecture design, storage is disaggregated from compute and sufficient network bandwidth is allocated to accommodate remote I/O access. This allows us to use this solution for all HDFS data, not just for COLD storage. In traditional architectures, one is forced to make new design decisions when the EC usage is being contemplated to improve storage efficiency.

We configured a HPE EPA WDO cluster in our lab using 7 storage nodes to store HDFS data. We compared both capacity utilization and performance (Teragen 3TB benchmark test) between replication with 3 replicas and erasure coding with RS(5,2) and XOR(5,2), all environments being able to sustain the loss of 2 storage node without losing data. We chose these custom erasure coding schemes in order to showcase the custom EC capabilities available in Hadoop 3.0 and better align with the number of nodes available for this test.

Testing using the Teragen 3TB benchmark shows >6x increase in usable space in a transparent manner. There are no code changes necessary in order to take advantage of these new features and all data is compressed transparently. The following table shows the amount of data used to store the 3TB dataset and Teragen 3TB execution runtimes.EC_Big Data_Blog_Update1.jpg

You’ll notice that RS encoding scheme doesn’t compress very well, mostly because the parity data generated by RS is not compressible. The addition of MaxLinear hardware compression cards increases the total solution list price less than 1%, providing an excellent ROI-based significant decrease in $/TB it generates.

The Teragen dataset uses randomly generated binary data which is not very compressible. In general text data compression ratios are in the 5x-10x range, while log compression ratios can get closer to 20x. Taking all this into consideration, an average compression ratio of ~5x for all data stored in HDFS seems reasonable, which in turns means this solution can increase the HDFS usable space by ~10x (2x from erasure coding and multiplied by 5x for compression).

In conclusion, combining our HPE EPA WDO architecture with HDFS erasure coding and hardware compression allows our customers to use erasure-coded storage and compression for all data without any code changes and benefit from significantly reduced $/TB and improved performance.

Featured articles:



 Meet Around the Storage Block blogger Daniel Pol, Data and Analytics Architect, HPE.








0 Kudos
About the Author


Our team of Hewlett Packard Enterprise storage experts helps you to dive deep into relevant infrastructure topics.