distinct() vs dropDuplicates() in Apache Spark

What’s the difference between distinct() and dropDuplicates() in Spark?

Giorgos Myrianthous
Towards Data Science
3 min readFeb 21, 2021

--

Photo by Juliana on unsplash.com

The Spark DataFrame API comes with two functions that can be used in order to remove duplicates from a given DataFrame. These are distinct() and dropDuplicates() . Even though both methods pretty much do the same job, they actually come with one difference which is…

--

--

I strive to build data-intensive systems that are not only functional, but also scalable, cost effective and maintainable over the long term.