- Community Home
- >
- Solutions
- >
- Tech Insights
- >
- Accelerate Spark performance with HPE Apollo 4200 ...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Receive email notifications
- Printer Friendly Page
- Report Inappropriate Content
Accelerate Spark performance with HPE Apollo 4200 Gen10Plus servers
The HPE Apollo 4200 Gen10 Plus server provides balanced high-density storage and a high-performance system perfect for optimizing Spark jobs. Learn all about it here.
Apache Spark is the open-source cluster computing framework known for its ability to process massive data sets by distributing them among multiple systems in parallel. Spark provides native bindings for programming languages, such as Python, R, Scala, and Java. It also supports machine learning, graph processing, and SQL databases. All good reasons why Spark is becoming the fact framework for processing big data, analytics, and AI/ML models.
Here, I’ll discuss the Apollo 4200 Gen10Plus server is the right platform to run Spark.
A brief Spark overview
Without delving in to all the details of Apache Spark architecture (BTW an excellent overview of Spark is provided in this Stanford course), I can summarize by saying that Apache Spark revolves around two concepts:
- Resilient Distributed Datasets (RDD) – RDD is a collection of immutable datasets distributed over a Spark cluster's nodes. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
- Directed Acyclic Graph (DAG) – Spark translates the RDD transformations into DAG. DAG can be considered as a sequence of data actions. DAG consists of vertices and edges. Vertices represent an RDD, and edges represent computations to be performed on that specific RDD. It is called a directed graph because are no loops or cycles within the graph.
The RDD and DAG are key reasons for Spark's high performance (Spark is 10-100x faster than MapReduce[1]), but at the same time, it introduces some limitations too.
DAG principal limit
Spark operators are pipelined and executed in parallel processes inside each stage. At the end of each stage, all intermediate results are materialized (shuffle) to make them available by the following stage. A shuffle is a physical movement of data across the network to be written to disk. Shuffle can be a critical and costly operation in Spark as it causes network traffic, disk I/O, and data serialization.
RDD central limit
Spark keeps the intermediate results in memory rather than on disk, which is very useful from the performance point of view. But when the data does not fit in memory, Spark offloads part of the data to disk (its runtime engine is designed to work in both memory and disk), losing part of the benefit of the in-memory elaboration.
How HPE Apollo 4200 Gen10 Plus mitigates Spark performance issues
To mitigate Spark limitations, you must proceed from optimizing the network, the disks, the memory, and lastly, the CPUs[2] HPE Apollo 4200 Gen10 Plusand even the GPUs. Let's see the progressive strategies we can use to improve Spark performance and how the HPE Apollo 4200 Gen10Plus system can help.
The HPE Apollo 4200 Gen10 Plus system is specifically designed to unlock the business value of data stemming from digital transformation (DX) and data infrastructure modernization at any scale and with ideal economics. It is designed for the whole spectrum of data-centric workloads – from deeper data lakes and archives to performance-demanding machine learning (ML), data analytics, hyper-converged infrastructure, and cache-intensive workloads.
Now here’s a detailed look at why the HPE Apollo 4200 Gen10 Plus System is the ideal platform for Spark.
The Apollo 4200 Gen10 Plus with the A2 GPU can be used as a symmetric node to accelerate Spark Jobs faster than a CPU-only compute node. Indeed, to demonstrate the benefits of using a GPU for Analytics and AI, the HPE engineering executed some performance tests on the HPE Apollo 4200 Gen10 Plus equipped with an NVIDIA® A2 GPU. For the complete description and the test result, check out the technical white paper: Why the HPE Apollo 4200 Gen10 Plus server is an optimal system to run Spark.
Why the HPE Apollo 4200 Gen10 Plus System is the optimal server to host Spark jobs
This balanced, high-density storage and high-performance system is perfect for optimizing Spark workloads for several key reasons:
For on thing, storage can be configured for the spectrum of workloads from data lakes to archives to cache and performance-intensive workloads through a large variety of media options (e.g., up to 28 Large Form Factor (LFF) with 4 NVMe-capable Small Form Factor (SFF), 24 LFF with 12 NVMe-capable SFF, or 60 SFF hot-plug drive bays).
In addition, HPE Apollo 4200 Gen10Plus servers can be equipped with:
- Two 3rd generation Intel® Xeon® Scalable processors of up to 32 cores each
- Selected GPU (NVIDIA A2)
- FPGA accelerators
- 50% more memory at 3200 MT/s speed
- Intel® Optane™ Persistent Memory
- Six FHHL PCIe Gen4 slots
And there's more:
- The presence of GPU and the FPGA accelerators makes the HPE Apollo 4200 Gen10 Plus System also capable of efficiently running video analytics (e.g., Image recognition).
- From the security point of view, the HPE Apollo 4200 Gen10 Plus System implements HPE iLO 5 and HPE Silicon Root of Trust technology (for firmware protection, malware detection, and firmware recovery), HPE Smart Encryption for data at rest encryption, and Defective Media Retention Service for maximizing sensitive data control in the event of a drive failure.
- The Apollo 4200 Gen10 Plus can also be acquired in a consumption-based IT model through HPE GreenLake Flex Capacity.
Find out more in this detailed content
- Apollo 4200 Gen10 Plus Overview
- Apollo 4200 Gen10 Plus QuickSpecs
- White paper: Why the HPE Apollo 4200 Gen10 Plus server is an optimal system to run Spark
And stay tuned to my blog series to learn more about HPE data store solutions for AI and advanced analytics.
[1] Hadoop vs Spark: A Deep Dive Comparison | StreamSets
[2] This blog covers only the Spark optimizations from the infrastructure point of view. the Spark optimizations at the application level (such as Data de-normalization, Broadcast Hash Join, etc.) are out-of-scope for this blog.
Andrea Fabrizi
Hewlett Packard Enterprise
twitter.com/HPE_Storage
linkedin.com/showcase/hpestorage/
hpe.com/storage
- Back to Blog
- Newer Article
- Older Article
- Amy Saunders on: Smart buildings and the future of automation
- Sandeep Pendharkar on: From rainbows and unicorns to real recognition of ...
- Anni1 on: Modern use cases for video analytics
- Terry Hughes on: CuBE Packaging improves manufacturing productivity...
- Sarah Leslie on: IoT in The Post-Digital Era is Upon Us — Are You R...
- Marty Poniatowski on: Seamlessly scaling HPC and AI initiatives with HPE...
- Sabine Sauter on: 2018 AI review: A year of innovation
- Innovation Champ on: How the Internet of Things Is Cultivating a New Vi...
- Bestvela on: Unleash the power of the cloud, right at your edge...
- Balconycrops on: HPE at Mobile World Congress: Creating a better fu...
-
5G
2 -
Artificial Intelligence
101 -
business continuity
1 -
climate change
1 -
cyber resilience
1 -
cyberresilience
1 -
cybersecurity
1 -
Edge and IoT
97 -
HPE GreenLake
1 -
resilience
1 -
Security
1 -
Telco
108