Yet another guide to debug spark OOM errors

Because there can never be enough of those ;-)

Just another spark error

I have recently had some troubles saving a spark dataframe to a Hive managed table (working on Cloudera 2.4.0-cdh6.3.4) due to out of memory (OOM) errors. In this post I am documenting some steps that helped me solve the problem. Maybe they help you?

First, how to tell you have an OOM error? Well, with Spark there is always a good chance it is an OOM error 😏 OOM can occur in two contexts:

  1. On the driver
  2. On the executors

Let’s have a look at both of those.

OOM error on the driver

This is the easy one. Your driver can run out of memory for two reasons primarily. First, you may have asked for too much data. If your spark dataframe contains a lot off data and you collect that data back to the driver, it can easily exceed the memory you have. So as a general rule:

# don't do:
sdf.collect()

# do 
sdf.limit(100).collect()

Next, be careful when you use broadcast. Broadcasting is a great way to serve small tables to all executors (in order to avoid a costly shuffle). But make sure your driver has enough memory to host this table. If in doubt, disable broadcast and see if the OOM persists. If so, broadcast was not the problem. To disable it, change the config of your spark session to .config("spark.sql.autoBroadcastJoinThreshold", -1). Now, on to more complicated problems!

OOM error on the executor(s)

This one is more tricky. OOM errors on executors like to play hide and seek. There are some dead give aways though. An exit code of 143 points towards memory problems. YARN killed container due to memory overhead is another one. Other errors that point to distinct problems exclude OOM. If the error message seems fishy, there is a good chance it is related to memory in my experience. Before we go into the details on how to fix each issue, lets recap the general architecture here.

<strong>Executor Memory architecture</strong>

Nodes have a defined amount of memory. In this memory executors live as containers. The executors memory is split between the memory available for the data and some YARN overhead memory. So when you are struggling with OOM errors, you are likely breaching either of the two.

To fix this, the following might help:

  1. Increase the number of partitions (spark.sql.shuffle.partitions)

    By default, this is set to 200, which depending on your application may be to few or too many. In your code you can also call df.repartition(num_partitions). Partitioning splits your data into subsets for each task. If there are too few partitions, too much data is crammed into one partition, potentially causing OOM erros. Increasing the number of partitions will decrease the amount of data in each single partition. If you use RDDs, increase spark.default.parallelism. Don’t just set it to some absurdly large value, as this will significantly slow down your job (or even break it). If you are at arround ~2000 partitions, it might be worth setting it to 2001, as spark uses a different compresssion algorithm at that point.

  2. Reduce the number of cores per executor (spark.executor.cores)

    Note that all cores on an executor share the memory of that executor, including the overhead memory. Irrespective of your partition size, OOM can occur if you that memory does not suffice given N cores try to run tasks on N partitions, given that the overhead for each partition is quite divorced from the partition size. This is usually refered to as high concurreny. It might also happen if you add a lot of columns without triggering a new stage (i.e. without the potential to repartition). To preserve the original execution speed you may want to increase the number of executors available to your job.

  3. Increase YARN memory overhead (spark.yarn.executor.memoryOverhead)

    By default this is set to 10% of the executors memory. If you use pyspark like I do, or sparkR, the memoryOverhead has to store all spark-internal objects you created in that language. If you get an error like “YARN killed container due to memory overhead”, try increasing this value.

So, that’s it for now. Hope this helps! Let me know if you have some cool spark tips to debug OOM, I am sure I will need them in the future 😏

Jannic Alexander Cutura
Jannic Alexander Cutura
Software ∪ Data Engineer

My interests include distributed computing and cloud as well as financial stability and regulation.

Related