Why Apache Spark Is Fast and How to Make It Run Faster

Published in

Towards Data Science

10 min readDec 3, 2019

Spark does things fast. That has always been the framework’s main selling point since it was first introduced back in 2010.

Offering a memory-based alternative to Map-Reduce gave the Big Data ecosystem a major boost, and throughout the past few years, it represented one of the key reasons for which companies adopted Big Data systems.

With its vast range of use cases, its ease-of-use, and its record-setting capabilities, Spark rapidly became everyone’s go-to framework when it comes to data processing within a Big Data architecture.

Part I: the Spark ABC

One of Spark’s key components is its SparkSQL module that offers the possibility to write batch Spark jobs as SQL-like queries. To do so, Spark relies behind the scenes on a complex mechanism to run these queries through the execution engine. This mechanism’s centerpiece is Catalyst: Spark’s query optimizer that does much of the heavy-lifting by generating the job’s physical execution plan.

Even though every step of this process was meticulously refined to optimize every aspect of the job. There is still plenty you could do from your end of the chain to make your Spark jobs run even faster. But before getting into that, let’s take a deeper dive into how Catalyst does things.

First of all, let’s start with the basics

Spark offers multiple ways to interact with its SparkSQL interfaces, with the main APIs being DataSet and DataFrame. These high-level APIs were built upon the object-oriented RDD API. And they kept its main characteristics while adding certain key features like the usage of schemas. (For a detailed comparison, please refer to this article on the Databricks blog).

The choice of the API to use depends mainly on the language you’re using. With DataSet being only available in Scala / Java, and replacing DataFrame for these languages since the release of Spark 2.0. And each one offers certain perks and advantages. The good news is that Spark uses the same execution engine under the hood to run your computations, so you can switch easily from one API to another without worrying about what’s happening on the execution level.

That means that no matter which API you’re using, when you submit your job it’ll go through a unified optimization process.

How Spark sees the world

The operations you can do within your Spark application are divided into two types:

Transformations: these are the operations that, when applied to an RDD, return a reference to a new RDD created via the transformation. Some of the most used transformations are filter and map. (Here’s a complete list of the available transformations)
Actions: When applied to an RDD, these operations return a non-RDD value. A good example would be the count action, that returns the number of elements within an RDD to the Spark driver, or collect, an action that sends the contents of an RDD to the driver. (Please refer to this link for a complete list of the actions that can be applied on RDDs).

The DataFrame and DataSet operations are divided into the same categories since these APIs are built upon the RDD mechanism.

The next differentiation to make is between the two types of transformations, which are the following:

Narrow transformations: When these transformations are applied on an RDD, there is no data movement between partitions. The transformation is applied to the data of each partition of the RDD and results in a new RDD with the same number of partitions, as demonstrated in the below illustration. For example, filter is a narrow transformation, because a filter is applied to the data of each partition and the resulting data represents a partition within the newly created RDD.

A narrow transformation (Source: Databricks)

Wide transformations: These transformations necessitate data movement between partitions, or what is known as shuffle. The data is moved across the network and the partitions of the newly-created RDD are based on the data of multiple input partitions, as illustrated below. A good example would be the sortBy operation, where data from all of the input partitions is sorted based on a certain column in a process that generates an RDD with new partitions.

A wide transformation (Source: Databricks)

So when you submit a job to Spark, what you’re submitting is basically a set of actions and transformations that are then turned into the job’s logical plan by Catalyst, before it generates the ideal physical plan.

Part II: The Spark Magic

Now that we know how Spark sees the jobs that we submit to it, let’s go through the mechanisms that turn that list of actions and transformations into the job’s physical execution plan.

Spark is a lazy magician

First of all, an important concept to remember when working with Spark is that it relies on lazy evaluation. That means that when you submit a job, Spark will only do its magic when it must to — i.e., when it receives an action (like when the driver asks for some data or when it needs to store data into HDFS).

Instead of running the transformations one by one as soon as it receives them, Spark stores these transformations in a DAG (Directed Acyclic Graph), and as soon as it receives an action, it runs the whole DAG and delivers the requested output. This enables it to optimize its execution plan based on the job’s DAG, instead of running the transformations sequentially.

How it all happens

Spark relies on Catalyst, its optimizer, to perform the necessary optimizations to generate the most efficient execution plan. At its core, Catalyst includes a general library dedicated to representing trees and applying rules to manipulate them. It leverages functional programming constructs in Scala and offers libraries specific to relational query processing.

Catalyst’s main data type is a tree composed of node objects, on which it applies a set of rules to optimize it. These optimizations are performed via four different phases, as indicated in the diagram below:

Catalyst’s optimization phases (source: Databricks)

Logical/Physical plan

One distinction that may not be very clear at first is the usage of the terms “logical plan” and “physical plan”. To put it simply, a logical plan consists of a tree describing what needs to be done, without implying how to do it, whereas a physical plan describes exactly what every node in the tree would do.

For example, a logical plan simply indicates that there’s a join operation that needs to be done, while the physical plan fixes the join type (e.g. ShuffleHashJoin) for that specific operation.

Now let’s go through these four steps and delve deeper into Catalyst’s logic.

Step 1: Analysis

The starting point for the Catalyst optimization pipeline is a set of unresolved attribute references or relations. Whether you’re using SQL or the DataFrame/Dataset APIs, SparkSQL has no idea at first on your data types or even the existence of the columns that you’re referring to (this is what we mean by unresolved). If you submit a select query, SparkSQL will first use Catalyst to determine the type of every column you pass and whether the columns you’re using actually exist. To do so it relies mainly on Catalyst’s trees and rules mechanisms.

It first creates a tree for the unresolved logical plan, then starts applying rules on it until it resolves all of the attribute references and relations. Throughout this process, Catalyst relies on a Catalog object that tracks the tables in all data sources.

Step 2: Logical optimization

In this phase, Catalyst gets some help. With the release of Spark 2.2 in 2017, a cost-based optimizer framework was introduced. Contrarily to rule-based optimizations, a cost-based optimizer uses statistics and cardinalities to find the most efficient execution plan, instead of simply applying a set of rules.

The output of the analysis step is a logical plan that then goes through a series of rule-based and cost-based optimizations in this second step. Catalyst applies all of the optimization rules on the logical plan and works with the cost-based optimizer to deliver an optimized logical plan to the next step.

Step 3: Physical planning

Just like the previous step, SparkSQL uses both Catalyst and the cost-based optimizer for the physical planning. It generates multiple physical plans based on the optimized logical plan before leveraging a set of physical rules and statistics to offer the most efficient physical plan.

Step 4: Code generation

Finally, Catalyst uses quasiquotes, a special feature offered by Scala, to generate the Java bytecode to run on each machine. Catalyst uses this feature by transforming the job’s tree into an abstract syntax tree (AST) that is evaluated by Scala, which then compiles and runs the generated code.

To sum up

Spark SQL relies on a sophisticated pipeline to optimize the jobs that it needs to execute, and it uses Catalyst, its optimizer, in all of the steps of this process. This optimization mechanism is one of the main reasons for Spark’s astronomical performance and its effectiveness.

Part III: Getting Spark to the Next Level

Now that we examined Spark’s sophisticated optimization process, it’s clear to us that Spark relies on a meticulously crafted mechanism to achieve its mind-boggling speed. But to think that Spark will give you optimal results no matter how you do things on your side is a mistake.

The assumption is easily made especially when migrating from another data-processing tool. A 50% shrink in processing time compared to the tool that you’ve been using could make you believe that Spark is running at full-speed and that you can’t reduce the execution time any further. The thing is, you can.

Spark SQL and its optimizer, Catalyst, can do wonders on their own, via the process we discussed above, but through some twists and techniques, you can take Spark to the next level. So let’s discuss how you could optimize Spark jobs from your end of the spectrum

Always take a look under the hood

The first thing to keep in mind when working with Spark is that the execution time doesn’t have much significance on its own. To evaluate the job’s performance, it’s important to know what’s happening under the hood while it’s running. During the development and testing phases, you need to frequently use the explain function to see the physical plan generated from the statements you wish to analyze, and for an in-depth analysis, you could add the extended flag to see the different plans that Spark SQL opted for (from the parsed logical plan to the physical plan). This is a great way to detect potential problems and unnecessary stages without even having to actually execute the job.

Know when to use the cache

Caching is very important when dealing with large datasets and complex jobs. It allows you to save the datasets that you plan on using in subsequent stages so that Spark doesn’t create them again from scratch. This advantage sometimes pushes developers into “over-caching” in a way that makes the cached datasets a burden that slows down your job instead of optimizing it. To decide which datasets you need to cache you have to prepare the totality of your job, and then through testing try to figure out which datasets are actually worth caching and at which point you could unpersist them to free up the space they occupy in memory when cached. Using the cache efficiently allows Spark to run certain computations 10 times faster, which could dramatically reduce the total execution time of your job.

Know your cluster, and your data

A key element to getting the most out of Spark is fine-tuning its configuration according to your cluster. Relying on the default configuration may be the way to go in certain situations, but usually you’re one parameter away from getting even more impressive results. Selecting the appropriate number of executors, the number of cores per executor, and the memory size for each executor are all elements that could greatly influence the performance of your jobs, so don’t hesitate to perform benchmark testing to see if certain parameters could be optimized.

Finally, an important factor to keep in mind is that you need to know the data that you’re dealing with and what to expect from every operation. When one stage is taking too long even though it’s dealing with less data than other stages, then you should inspect what’s happening on the other side. Spark is great when it comes to doing the heavy-lifting and running your code, but only you could detect business-related issues that may be related to the way you defined your job.

If you apply all of these rules while developing and implementing your Spark jobs, you can expect the record-breaking processing tool to reward you with jaw-dropping results.

These recommendations are merely a first step towards mastering Apache Spark. In upcoming articles, we’ll discuss its different modules in detail to get an even better understanding of how Spark functions.

This article was first published on INVIVOO’s tech blog.

For more data engineering content you can subscribe to my bi-weekly newsletter, Data Espresso, in which I’ll discuss various topics related to data engineering and technology in general:

Data Espresso

Data engineering updates and commentary to accompany your afternoon espresso. Click to read Data Espresso, by Mahdi…

dataespresso.substack.com