Migrating to a post-Spark on YARN world

LolaTam · ‎09-03-2021

Migrating from a post-Spark on YARN (SoY) implementation to a Spark Operator that runs on Kubernetes can be challenging, says Chad Smykay, CTO HPE Ezmeral Software at HPE. He recently published an article in The New Stack about how to have a successful journey to a more modern implementation of Spark. He outlines three key strategies, which I highlight below.

1) What workloads have the simplest job and YARN container requirements?

Chad explains that you should concentrate first on the low-hanging fruit -- workloads that have the least-complex YARN configurations. Many articles, blog posts, and even custom calculators show you how to best calculate your YARN container configurations for your workload. Chad’s favorite is from Princeton Research Computing on Tuning Spark Applications. For the most part, he thinks their explanations are the simplest to follow when trying to tune your Spark applications on YARN.

Figure 1. Calculating your YARN container configuration, Princeton Research Computing

Chad advises, “Your simplest YARN container definitions should be moved first as those will more easily translate to Kubernetes resource assignments (number CPU, memory, etc.). If you have more complex YARN scheduler definitions, such as those used with the fair scheduler or capacity scheduler, you should move those last after you have considered how your Kubernetes resource assignment will be defined.”

(Note: A YARN implementation using a capacity scheduler more easily translates into shared resources within a single Kubernetes cluster deployment with multiple workloads.)

2) What workloads have the least amount of data connectivity needs?

Chad explains that part of moving to a post-SoY implementation is more freedom of choice on connecting to either your current or new data sources that Spark can use. Common methods include:

Connecting to existing HDFS clusters
Connecting to S3 API enabled storage
Connecting to Cloud Object Storage providers
Connecting to other filesystems using Kubernetes CSI

Many organizations are updating their standard’s data-access definition patterns. They take the time to define where it should be stored for which business use case or data type. The most common is storing all data on S3 API-enabled storage, such as HPE Ezmeral Data Fabric or those within a cloud provider. According to Chad, Kubernetes will give you greater flexibility in connecting to more new and interesting data sources, and those should be accounted for in your data governance policies.

3) What workloads need strict compute and storage latency?

Chad explains that one of the benefits with Hadoop-era workloads was its powerful combination of having storage next door to compute. In the initial MapReduce days, there were some issues with the shuffle tasks of your workload, but you could control them if needed. Part of the benefits with SoY is having that combination of compute and storage, which means for most workloads, data transfers should be reduced. When you migrate to a Spark on Kubernetes workload, you must keep this in mind.

According to Chad, you should ask yourself the following questions:

Do I have a large data size of files or data sets that are read into your Spark jobs?
Do I have a large number of files or data sets that are read in your Spark jobs?
If I introduce additional read or write latency to my Spark jobs, will that affect my job time or performance?

He also says it’s important to run a sample job on your new Spark implementation being careful to note your RDD read and write times. One way to get a level set of base performance on your current implementation versus your new implementation is to turn off all “MEMORY_ONLY” settings on your RDDs. That’s because if you can get a baseline of what your “DISK_ONLY” performance is, your memory-enabled RDD’s performance should be like for like, assuming you will be using the same number of resources for assignment in Kubernetes.

Chad notes that moving to a post-SoY world means you have to revisit your security policies and monitoring system implementation to properly secure and monitor Spark on Kubernetes resources. Fortunately, HPE Ezmeral has a single container platform for analytics that can support you on this central security and monitoring journey to your new workload.

If you or your organization are struggling to start your journey on a post-SoY implementation, HPE can help. Check out the HPE AMP Assessment Program, a proven best practices migration methodology. Read the full article by Chad Smykay, Don’t Get Stuck: Migrating to a Post-Spark on YARN World.

Lola Tam

Hewlett Packard Enterprise

HPE Ezmeral on LinkedIn | @HPE_Ezmeral on Twitter

@HPE_DevCom on Twitter

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Migrating to a post-Spark on YARN world

LolaTam