Best Practices for Bucketing in Spark SQL

The ultimate guide to bucketing in Spark.

Published in

Towards Data Science

21 min readApr 25, 2021

Bucketing is a feature supported by Spark since version 2.0. It is a way how to organize data in the filesystem and leverage that in the subsequent queries.

There are many resources that explain the basic idea of bucketing, in this article, we will go one step further and describe bucketing more in detail, we will see various different aspects it can have and explain how it works under the hood, how it evolved over time and — most importantly — how to efficiently…

Best Practices for Bucketing in Spark SQL

The ultimate guide to bucketing in Spark.

Written by David Vrba