Best Practices for Bucketing in Spark SQL

The ultimate guide to bucketing in Spark.

David Vrba
Towards Data Science
21 min readApr 25, 2021

--

Bucketing is a feature supported by Spark since version 2.0. It is a way how to organize data in the filesystem and leverage that in the subsequent queries.

There are many resources that explain the basic idea of bucketing, in this article, we will go one step further and describe bucketing more in detail, we will see various different aspects it can have and explain how it works under the hood, how it evolved over time and — most importantly — how to efficiently…

--

--

Senior ML Engineer at Sociabakers and Apache Spark trainer and consultant. I lecture Spark trainings, workshops and give public talks related to Spark.