About Joins in Spark 3.0

Tips for efficient joins in Spark SQL

David Vrba
Towards Data Science
9 min readJun 24, 2020

--

One of the very frequent transformations in Spark SQL is joining two DataFrames. The syntax for that is very simple, however, it may not be so clear what is happening under the hood and whether the execution is as efficient as it could be.

Spark provides a couple of algorithms for join execution and will choose one of them according to some internal logic. This choice may not be the best in all cases and having a proper understanding of the internal behavior may allow us to lead Spark towards better…

--

--

Senior ML Engineer at Sociabakers and Apache Spark trainer and consultant. I lecture Spark trainings, workshops and give public talks related to Spark.