The world’s leading publication for data science, AI, and ML professionals.

How You Can Make Pandas Joins 5 Times Faster

A faster alternative to pd.merge()

Photo by Jonathan Petersson on Unsplash
Photo by Jonathan Petersson on Unsplash

Working with real-world datasets has always intrigued Data Scientists to build optimized, efficient, and accurate systems at scale. Undoubtedly, optimization is the key to triumph in building real-world software solutions. While I understand that not everyone is building solutions at scale, awareness about optimization techniques is nevertheless helpful and applicable to even generic Data Science use-cases.

Therefore, in this post, I will consider one of the most sought-after functions in Pandas, i.e., Joins, and share how you can optimize the Pandas Merge operation by close to 5 times with experimental proofs. If you are unfamiliar with Joins in Pandas, I would highly recommend this post to understand them before proceeding further.


Experiment 1: Using pd.merge()

Code snippet for pd.merge (image created by author using snappify.io)
Code snippet for pd.merge (image created by author using snappify.io)

Here, we create two dummy Dataframes and merge them using the pd.merge() function – explicitly specifying which columns to execute the merge operation on.

Experiment 2: Indexing columns and using df.join()

Code snippet for pd.join (image created by author using snappify.io)
Code snippet for pd.join (image created by author using snappify.io)

Similar to experiment 1, we create two dummy DataFrames. However, instead of proceeding directly towards a Join, we first index the DataFrames and then adopt the JOIN method to merge the DataFrames.

Experimental setting:

  • In both experiments, namely, experiment 1 and experiment 2, both the DataFrames had an equal number of rows and two columns each – one of which was used for joining the DataFrames.
  • I considered different-sized DataFrames starting from a million rows to ten million rows and incremented the number of rows by 1 million with each experiment.
  • I repeated an experimental trial on a fixed number of rows ten times to eliminate any randomness.
  • Below, I have reported the average run-time for merge operation across these ten trials.

Results:

Now, it’s time for mind-boggling results.

Experimental results depicting run-time of join operations (image created by author)
Experimental results depicting run-time of join operations (image created by author)
  • The above plot depicts the time taken (in milliseconds) to perform the merge operation in experiment 1 and the (index + merge) operation in experiment 2.
  • As we see from the plot, there is a significant difference between the run-time of both experiments— experiment 2 being close to 5 times as fast as experiment 1.
  • As the size of the DataFrame increases, so does the difference between the run-time of both experiments. This indicates that indexing should always be your way to proceed with larger datasets.
  • Both JOIN operations almost linearly increase with the size of the DataFrame. However, the rate at which the run-time of experiment 2 increases is much lower than that of experiment 1.

To conclude, I would highly recommend everyone reading this post to immediately STOP using pd.merge() to perform JOIN operation, especially if you are someone working with large amounts of data.

Thanks for reading. I hope this post was helpful.


Meme created by author
Meme created by author

Related Articles