topic Re: Brainstorm on Memory issues for Spark in HPE Ezmeral Software platform

Brainstorm on Memory issues for Spark

Hao_Zhu — Thu, 21 Jan 2021 18:49:21 GMT

Hi Team,

Have you encountered any kinds of memory issues on Spark?

If so, do you want to share the troubleshooting tips?

Thanks~

Re: Brainstorm on Memory issues for Spark

idyptan — Tue, 26 Jan 2021 13:17:20 GMT

I would say the Memory issue is a subtopic for more general Optimization issue in Spark.

Since Spark was designed as in-memory computation framework, naturally it is more demanding to RAM space than lagacy MR. Therefore it is always good idea to design your cluster specification with this in mind.

However there is no recepy to make your cluster highly utilised and never hit OOM. It is always speculative and is subject to change with time. I would argue this is about the balance of stability and costs. With time you get understanding what is reasonable capacity for your workloads. This is iterative and dynamic process.

There are multiples layers of memory you should consider before taking actions on OOM issue.

1. Is a physical memory - this is what OS sees when the job is launched. In Linux you check it with "top", "free", etc.

If you're submitting Spark jobs with Yarn RM you can diagnose this type of OOM in container logs:

Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits.
12.4 GB of 12.3 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead.
Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits.
4.5GB of 3GB physical memory used limits.
Consider boosting spark.yarn.executor.memoryOverhead.

As suggested consider boosting "spark.yarn.executor.memoryOverhead". Typically, you need to allocate 1/10 of spark.executor.memory to get rid of it.

2. Virtual memory. This is your app's physical memory + swap (paged files).

This is managed by RM, and diagnosed by message below:

Container killed by YARN for exceeding memory limits.
1.1gb of 1.0gb virtual memory used. Killing container.

Can be solved by disabling vmem check on NM:

"yarn.nodemanager.vmem-check-enabled":"false"

3. Java Heap Space. This is memory avaiable for Spark JVM inself (driver/executor).

It can be detected in container logs as message below:

WARN TaskSetManager: Loss was due to 
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space

You request RM for memory slots for your Spark app by setting this intrinsic config spark.executor.memory

In many cases if this run out, Spark would try to spill data to disk and no OOM occurs. You as a Spark app developer can chose not to use disk at all for performance concerns. Then your app fails fast with OOM, instead of occupying resources of your cluster.

There are numerous optimisation techniqes however to lower the memory footprint.

Here are useful links that cover this subject:

https://0x0fff.com/spark-memory-management/

https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

https://g1thubhub.github.io/spark-memory.html

Re: Brainstorm on Memory issues for Spark

Vinayak_Meghraj — Tue, 02 Feb 2021 07:23:48 GMT

By default, Spark uses On-heap memory only. The size of the On-heap memory is configured by the –executor-memory or spark.executor.memory parameter when the Spark Application starts. The concurrent tasks running inside Executor share JVM's On-heap memory.

The On-heap memory area in the Executor can be roughly divided into the following four blocks:

Storage Memory: It's mainly used to store Spark cache data, such as RDD cache, Broadcast variable, Unroll data, and so on.
Execution Memory: It's mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc.
User Memory: It's mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency.
Reserved Memory: The memory is reserved for system and is used to store Spark's internal objects.

https://support.datafabric.hpe.com/s/article/Spark-Troubleshooting-guide-Memory-Management-How-to-troubleshooting-out-of-memory-OOM-issues-on-Spark-Executor?language=en_US

Re: Brainstorm on Memory issues for Spark

Michael_Segel — Tue, 23 Mar 2021 19:26:23 GMT

Hi,

Spark is not an "in-memory" solution.

Spark was created by Amp Labs to reduce the latency between the map and reduce cycles found in MR1 and MR2.

Spark does rely more on memory (both heap and non heap memory.) but it also caches to local disk.

Re: Brainstorm on Memory issues for Spark

Michael_Segel — Tue, 23 Mar 2021 19:29:10 GMT

@Hao_Zhu

There many reasons why you could have memory issues w your spark applications.

You could have very inefficient code, along with sizing of your spark job and even cluster container space if you're running spark on your cluster.

There are a lot of factors and places to look.

Can you be more specific where you are having problems?

Also which version of spark and which features are you using? (e.g. SparkSQL, SparkStructuredStreaming, etc...)

Re: Brainstorm on Memory issues for Spark

vathi106 — Thu, 17 Feb 2022 02:32:28 GMT

I recently encountered Spark always creating 200 partitions after wider transformations. Sometimes I would need less and sometimes I would need more. To resolve this I have enabled Spark 3.0 adaptive query execution.

Spark 3.0 provides Adaptive Query execution which improves the query performance by re-optimizing the query plan during runtime. You can enable this by setting

spark.conf.set("spark.sql.adaptive.enabled",true)

Spark 3 dynamically determines the optimal number of partitions by looking at the metrics of the completed stage. In order to use this, you need to enable the below configuration.

spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",true)

Thanks