Tech Insights
1832553 Members
6237 Online
110043 Solutions
New Article
AndreaFabrizi1

Accelerate Spark performance with HPE Apollo 4200 Gen10Plus servers

The HPE Apollo 4200 Gen10 Plus server provides balanced high-density storage and a high-performance system perfect for optimizing Spark jobs. Learn all about it here.

Apache Spark is the open-source cluster computing framework known for its ability to process massive data sets by distributing them among multiple systems in parallel. Spark provides native bindings for programming languages, suchHPE Tech Insights-Apollo-accelerate Spark-HPE2022042711070.jpg as Python, R, Scala, and Java. It also supports machine learning, graph processing, and SQL databases. All good reasons why Spark is becoming the fact framework for processing big data, analytics, and AI/ML models.

Here, I’ll discuss the Apollo 4200 Gen10Plus server is the right platform to run Spark.

A brief Spark overview

Without delving in to all the details of Apache Spark architecture (BTW an excellent overview of Spark is provided in this Stanford course), I can summarize by saying that Apache Spark revolves around two concepts:

  • Resilient Distributed Datasets (RDD) – RDD is a collection of immutable datasets distributed over a Spark cluster's nodes. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
  • Directed Acyclic Graph (DAG) – Spark translates the RDD transformations into DAG. DAG can be considered as a sequence of data actions. DAG consists of vertices and edges. Vertices represent an RDD, and edges represent computations to be performed on that specific RDD. It is called a directed graph because are no loops or cycles within the graph.

The RDD and DAG are key reasons for Spark's high performance (Spark is 10-100x faster than MapReduce[1]), but at the same time, it introduces some limitations too.

DAG principal limit

Spark operators are pipelined and executed in parallel processes inside each stage. At the end of each stage, all intermediate results are materialized (shuffle) to make them available by the following stage. A shuffle is a physical movement of data across the network to be written to disk. Shuffle can be a critical and costly operation in Spark as it causes network traffic, disk I/O, and data serialization.

RDD central limit

Spark keeps the intermediate results in memory rather than on disk, which is very useful from the performance point of view. But when the data does not fit in memory, Spark offloads part of the data to disk (its runtime engine is designed to work in both memory and disk), losing part of the benefit of the in-memory elaboration.

How HPE Apollo 4200 Gen10 Plus mitigates Spark performance issues

To mitigate Spark limitations, you must proceed from optimizing the network, the disks, the memory, and lastly, the CPUs[2] HPE Apollo 4200 Gen10 PlusHPE Apollo 4200 Gen10 Plusand even the GPUs. Let's see the progressive strategies we can use to improve Spark performance and how the HPE Apollo  4200 Gen10Plus system can help.

The HPE Apollo 4200 Gen10 Plus system is specifically designed to unlock the business value of data stemming from digital transformation (DX) and data infrastructure modernization at any scale and with ideal economics. It is designed for the whole spectrum of data-centric workloads – from deeper data lakes and archives to performance-demanding machine learning (ML), data analytics, hyper-converged infrastructure, and cache-intensive workloads.

Now here’s a detailed look at why the HPE Apollo 4200 Gen10 Plus System is the ideal platform for Spark.

HPE Apollo 4200 Gen10 Plus-Spark performance issues-blog.png

The Apollo 4200 Gen10 Plus with the A2 GPU can be used as a symmetric node to accelerate Spark Jobs faster than a CPU-only compute node. Indeed, to demonstrate the benefits of using a GPU for Analytics and AI, the HPE engineering executed some performance tests on the HPE Apollo 4200 Gen10 Plus equipped with an NVIDIA® A2 GPU. For the complete description and the test result, check out the technical white paper: Why the HPE Apollo 4200 Gen10 Plus server is an optimal system to run Spark.

Why the HPE Apollo 4200 Gen10 Plus System is the optimal server to host Spark jobs

This balanced, high-density storage and high-performance system is perfect for optimizing Spark workloads for several key reasons:

For on thing, storage can be configured for the spectrum of workloads from data lakes to archives to cache and performance-intensive workloads through a large variety of media options (e.g., up to 28 Large Form Factor (LFF) with 4 NVMe-capable Small Form Factor (SFF), 24 LFF with 12 NVMe-capable SFF, or 60 SFF hot-plug drive bays).

In addition,  HPE Apollo 4200 Gen10Plus servers can be equipped with:

  • Two 3rd generation Intel® Xeon® Scalable processors of up to 32 cores each
  • Selected GPU (NVIDIA A2)
  • FPGA accelerators
  • 50% more memory at 3200 MT/s speed
  • Intel® Optane™ Persistent Memory
  • Six FHHL PCIe Gen4 slots

And there's more:

  • The presence of GPU and the FPGA accelerators makes the HPE Apollo 4200 Gen10 Plus System also capable of efficiently running video analytics (e.g., Image recognition).
  • From the security point of view, the HPE Apollo 4200 Gen10 Plus System implements HPE iLO 5 and HPE Silicon Root of Trust technology (for firmware protection, malware detection, and firmware recovery), HPE Smart Encryption for data at rest encryption, and Defective Media Retention Service for maximizing sensitive data control in the event of a drive failure.
  • The Apollo 4200 Gen10 Plus can also be acquired in a consumption-based IT model through HPE GreenLake Flex Capacity.

Find out more in this detailed content

And stay tuned to my blog series to learn more about HPE data store solutions for AI and advanced analytics.

[1] Hadoop vs Spark: A Deep Dive Comparison | StreamSets

[2] This blog covers only the Spark optimizations from the infrastructure point of view. the Spark optimizations at the application level (such as Data de-normalization, Broadcast Hash Join, etc.) are out-of-scope for this blog.


Andrea Fabrizi
Hewlett Packard Enterprise

twitter.com/HPE_Storage
linkedin.com/showcase/hpestorage/
hpe.com/storage

0 Kudos
About the Author

AndreaFabrizi1

Andrea Fabrizi is the Strategic Portfolio Manager for Big Data and Analytics at HPE.