- Community Home
- >
- HPE AI
- >
- AI Unlocked
- >
- Migrating to a post-Spark on YARN world
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Receive email notifications
- Printer Friendly Page
- Report Inappropriate Content
Migrating to a post-Spark on YARN world
Migrating from a post-Spark on YARN (SoY) implementation to a Spark Operator that runs on Kubernetes can be challenging, says Chad Smykay, CTO HPE Ezmeral Software at HPE. He recently published an article in The New Stack about how to have a successful journey to a more modern implementation of Spark. He outlines three key strategies, which I highlight below.
1) What workloads have the simplest job and YARN container requirements?
Chad explains that you should concentrate first on the low-hanging fruit -- workloads that have the least-complex YARN configurations. Many articles, blog posts, and even custom calculators show you how to best calculate your YARN container configurations for your workload. Chad’s favorite is from Princeton Research Computing on Tuning Spark Applications. For the most part, he thinks their explanations are the simplest to follow when trying to tune your Spark applications on YARN.
Figure 1. Calculating your YARN container configuration, Princeton Research Computing
Chad advises, “Your simplest YARN container definitions should be moved first as those will more easily translate to Kubernetes resource assignments (number CPU, memory, etc.). If you have more complex YARN scheduler definitions, such as those used with the fair scheduler or capacity scheduler, you should move those last after you have considered how your Kubernetes resource assignment will be defined.”
(Note: A YARN implementation using a capacity scheduler more easily translates into shared resources within a single Kubernetes cluster deployment with multiple workloads.)
2) What workloads have the least amount of data connectivity needs?
Chad explains that part of moving to a post-SoY implementation is more freedom of choice on connecting to either your current or new data sources that Spark can use. Common methods include:
- Connecting to existing HDFS clusters
- Connecting to S3 API enabled storage
- Connecting to Cloud Object Storage providers
- Connecting to other filesystems using Kubernetes CSI
Many organizations are updating their standard’s data-access definition patterns. They take the time to define where it should be stored for which business use case or data type. The most common is storing all data on S3 API-enabled storage, such as HPE Ezmeral Data Fabric or those within a cloud provider. According to Chad, Kubernetes will give you greater flexibility in connecting to more new and interesting data sources, and those should be accounted for in your data governance policies.
3) What workloads need strict compute and storage latency?
Chad explains that one of the benefits with Hadoop-era workloads was its powerful combination of having storage next door to compute. In the initial MapReduce days, there were some issues with the shuffle tasks of your workload, but you could control them if needed. Part of the benefits with SoY is having that combination of compute and storage, which means for most workloads, data transfers should be reduced. When you migrate to a Spark on Kubernetes workload, you must keep this in mind.
According to Chad, you should ask yourself the following questions:
- Do I have a large data size of files or data sets that are read into your Spark jobs?
- Do I have a large number of files or data sets that are read in your Spark jobs?
- If I introduce additional read or write latency to my Spark jobs, will that affect my job time or performance?
He also says it’s important to run a sample job on your new Spark implementation being careful to note your RDD read and write times. One way to get a level set of base performance on your current implementation versus your new implementation is to turn off all “MEMORY_ONLY” settings on your RDDs. That’s because if you can get a baseline of what your “DISK_ONLY” performance is, your memory-enabled RDD’s performance should be like for like, assuming you will be using the same number of resources for assignment in Kubernetes.
Chad notes that moving to a post-SoY world means you have to revisit your security policies and monitoring system implementation to properly secure and monitor Spark on Kubernetes resources. Fortunately, HPE Ezmeral has a single container platform for analytics that can support you on this central security and monitoring journey to your new workload.
If you or your organization are struggling to start your journey on a post-SoY implementation, HPE can help. Check out the HPE AMP Assessment Program, a proven best practices migration methodology. Read the full article by Chad Smykay, Don’t Get Stuck: Migrating to a Post-Spark on YARN World.
Lola Tam
Hewlett Packard Enterprise
HPE Ezmeral on LinkedIn | @HPE_Ezmeral on Twitter
@HPE_DevCom on Twitter
LolaTam
Lola Tam is a senior product marketing manager, focused on content creation to support go-to-market efforts for the HPE Enterprise Software Business Unit. Areas of interest include application modernization, AI / ML, and data science, and the benefits these solutions bring to customers.
- Back to Blog
- Newer Article
- Older Article
- Dhoni on: HPE teams with NVIDIA to scale NVIDIA NIM Agent Bl...
- SFERRY on: What is machine learning?
- MTiempos on: HPE Ezmeral Container Platform is now HPE Ezmeral ...
- Arda Acar on: Analytic model deployment too slow? Accelerate dat...
- Jeroen_Kleen on: Introducing HPE Ezmeral Container Platform 5.1
- LWhitehouse on: Catch the next wave of HPE Discover Virtual Experi...
- jnewtonhp on: Bringing Trusted Computing to the Cloud
- Marty Poniatowski on: Leverage containers to maintain business continuit...
- Data Science training in hyderabad on: How to accelerate model training and improve data ...
- vanphongpham1 on: More enterprises are using containers; here’s why.