Servers & Systems: The Right Compute

HPE Elastic Platform for Analytics: Why infrastructure matters in big data pipeline design


Are you ready to begin creating infrasElastic Platform Blog.jpgtructure for a big data pipeline? Before you move forward, it’s important to understand why the right design matters.

Designing infrastructure for big data analytics brings with it no shortage of challenges. Not all enterprises have the in-house expertise to design and build large scale (PB+) data lakes ready to quickly move into production. A litany of open source tools creates enormous complexity in design and integration. Most legacy data and analytics systems are ill-equipped to handle new data and workloads. Old design principles such as using core/spindle ratios are no longer a reliable guide with newer workloads.

Modern data pipelines will require extensive use of machine learning, deep learning and artificial intelligence frameworks to analyze and perform real-time predictive analytics against both structured and unstructured data. Next-generation, real and near-real time analytics requires a scalable, flexible high-performing platform.

Enterprises need to look beyond traditional commodity hardware, particularly with latency sensitive tools like Spark, Flink and Storm, along with NoSQL databases like Cassandra and HBase where low latency is mandatory. Data locality, data gravity, data temperature and the network all have to be part of the overall design. Add in data protection and data governance and you have a large number of variables to consider.

The traditional approach with Hadoop 1.0 was to use co-located compute and storage, which worked six-to-eight years ago when the focus was on batch analytics using HDFS and MapReduce. With the wave of technologies in the current Hadoop 3.0 ecosystem and beyond, co-locating compute and storage can be extremely inefficient and have negative implications on performance and scaling.

Here’s the new reality: there is no typical or single “big data workload” that you can use as a guide upon which to base design decisions. Different workloads will have different resource requirements, ranging from batch processing (a balanced design), interactive processing (more CPU), and machine learning (more GPUs). The traditional symmetric design (co-located storage and compute) leads to trapped resources and power/space constraints. You end up with multiple copies of data due to governance, security and performance concerns. The transition must be to a flexible, scalable, high performing architecture.

Address these needs with the HPE Elastic Platform for Analytics (EPA) architecture

HPE EPA is a modular infrastructure foundation designed to deliver a scalable, multi-tenant platform by enabling independent scaling of compute and storage through infrastructure building blocks that are optimized for density and running disparate workloads.

HPE EPA environments allow for the independent scaling of compute and storage and employ higher speed networking than previous generation Hadoop Clusters. They also enable consolidation and isolation of multiple workloads while sharing data, improving security and governance. In addition, workload-optimized nodes help with optimal performance and density considerations.

We recently had a customer requirement where the organization wanted to build a next-generation analytics environment for its business. Part of the challenge included changing architectural and business requirements. In particular, the initial design focused on Spark workloads and the final design focused on both Spark and Impala with critical SLAs attached to response time on Impala table scan queries.

The day-one cluster was primarily running Spark and Impala, plus services like HBase, Kudu, etc. got added over time. This is where an architecture like HPE EPA comes in handy. We were able to use purpose-built compute tiers for running Spark and Impala jobs and a separate storage tier for HDFS and Kudu. HPE EPA provided elastic scalability to grow and/or add workload specific compute and storage nodes. Here is a pictorial representation of the customer scenario and solution.

                                                                 Challenges with a traditional cluster designBig Data Cluster.jpg

                                                                    Solution with HPE EPA Elastic clusterBig Data Cluster Solution.png

What exactly makes this architecture elastic?

HPE EPA allows the scaling of distinct nodes and resources independently which is critical with the diversity of tools and workloads in the big data ecosystem. It even allows you to change the node function on the fly (as described in the previous example). You can also add compute nodes without repartitioning the data. Containers enable rapid deployment and movement of workloads and models in line with fast data analytics requirements.

In summary, multi-tenant, elastic and scalable data lakes built on the HPE EPA architecture and a big data pipeline meet your next-generation requirements. Here is a pictorial representation.EPA - Benefits.png

Get more information on the HPE EPA architecture or refer to this reference architecture.  Or contact your local HPE sales representative.

For suggestions on optimized hardware based on workload, check out the HPE EPA Sizing Tool.

Meet Infrastructure Insights blogger Mandar Chitale, HPE Solution Engineering Team.

Mandar.jpgMandar has two decades of experience in the IT industry. Currently, he is a Program Manager with the HPE Solution Engineering Team which is focused on creating Solution Reference Architectures for enterprise use cases based on the traditional and emerging digital technology scape.



About the Author


Our team of HPE and other technology experts shares insights about relevant topics related to artificial intelligence, data analytics, IoT, and telco.