
A new major release was made available on the 10th of June 2020 for Apache Spark. Version 3.0 – a result of more than 3,400 tickets – builds on top of version 2.x and comes with numerous features – new functionality, bug fixes and performance improvements.
10 years after its initial release as an open source project, Apache Spark has become one of the core technologies in Big Data era. A growing number of applications running in production environments are built with Spark as it offers a unified engine for processing fairly large amount of data in a reasonable amount of time.
Spark 3.0 has shipped a number of exciting new features and performance improvements. Here are the five most promising ones:
1. Adaptive Query Execution (AQE) enhancements
Unlike more traditional technologies, runtime adaptivity in Spark is crucial as it enables the optimization of execution plans based on the input data. The reason why this is so important in Spark is due to the fact that the data itself affects the efficiency of the application. A good example that demonstrates the importance of the dynamic adaptation of execution plans is broadcasting. The adaptive execution mode can alter a shuffle join to a broadcast join if the table size allows it (i.e. if its size does not exceed that broadcast limit). For some data inputs this might be feasible where in some other occasions might not. Another good example that showcases the importance of AQE is data skewness. When Adaptive Query Execution mode is enabled, it can also dynamically change the partitions used in upcoming changes.
Spark 3.0 introduced two major improvements over Adaptive Query Execution that now simplify Spark parameter tuning even further:
- AQE now combines small partitions together so that users don’t need to worry too much about shuffle partitions since this will now be dynamically adjusted in runtime.
- AQE automatically breaks down partitions into smaller ones once data skewness is detected
For the full list of optimizations introduced in Spark 3.0 you can refer to this JIRA ticket.
2. Improvements on pandas UDF API
Pandas UDFs (User-Defined Functions) are probably one of the most significant features added to Spark since version 2.3 as they allow users to leverage pandas API in Apache Spark.
The newest release of Apache Spark introduced a new interface of Pandas UDFs with Python type hints. In their early releases, pandas UDFs were not so consistent and easy to follow and/or use. In this direction, type hints introduced in version 3.0 will definitely help eliminate confusions among developers. Below, you can find an example of a pandas UDF using type hints that denote the type of the expected input(s) and output:
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf('long')
def pandas_subtract_unit(s: pd.Series) -> pd.Series:
return s - 1
The new interface allows pandas UDFs to infer the type from the given Python type hints in the Python function definition. Currently four distinct cases are supported for the type hints in Pandas UDFs. These are:
- Series -> Series (like the example above)
- Iterator of Series -> Iterator of Series
- Iterator of multiple Series -> Iterator of Series
- Series -> scalar
That’s definitely a very good starting point however the community should consider providing support for more type hints as the currently supported ones is just a small fraction of the many possible combinations.
On another note, Spark 3.0 offers more pythonic error handing by removing the JVM stack trace which is not necessary.
3. New User Interface for Structured Streaming
Web UI in Apache Spark 3.0 comes with an extra tab which is dedicated to Structured Streaming and simplifies monitoring over streaming jobs.
This tab displays scheduling delay and processing time for each micro-batch in the data stream, which can be useful for troubleshooting the streaming application – Spark Docs
The statistics page of a particular streaming query currently contains five metrics:
- Input Rate
- Process Rate
- Input Rows
- Batch Duration
- Operation Duration

4. More than 30 new built-in functions
Spark 3.0 comes with a bunch of new built-in functions, from bit counts to hyperbolic functions (e.g. hyperbolic sin/cos/tan) and csv operations. For the full list of newly added functions refer to the release notes
5. Hydrogen: Deep Learning improvements
It is well known that in order to build AI/ML models that perform really well, you will need massive amounts of data to train these models on. And one of the biggest challenges in the few past years was the compatibility between data processing frameworks (like Spark) and distributed Deep Learning frameworks. While Spark jobs are split into multiple independent tasks, most of the Deep Learning frameworks use quite different logic when it comes to execution (e.g. tasks are dependent to each other).
Project Hydrogen is a Spark initiative that aims to unify big data processing and machine learning model training and is split into three main sub-sections:
- Barrier execution mode
- Optimized data exchange
- Accelerator-aware scheduling
The initial implementation of Barrier execution mode was made available in Spark 2.4 while the work for the other two pillars were under development.
Spark 3.0 comes with an enhanced scheduler that makes the cluster manager accelerator-aware. Deep Learning frameworks are hugely dependent on accelerators such as GPUs that help them accelerate their workloads. Spark is now able to identify available GPUs and assign them to tasks properly.
These are great news for people who were previously struggling to were using Spark to load and process the data but they then had to use alternative solutions for training their models since Spark wasn’t aware of the GPUs in the cluster.
Spark 3.0.1
In early September, the first maintenance release (3.0.1) was also made available and it mostly contains stability fixes. Make sure to keep an eye on the latest releases to ensure that you won’t hit any recent bugs.
Conclusion
The newest major version of Spark comes with numerous features and performance improvements and considering to upgrade to the latest version is definitely a wise choice. In this article only a few subset of the new features of Apache Spark 3.0 was discussed. For the full list of features refer to the official documentation.