
Working with real-world datasets has always intrigued Data Scientists to build optimized, efficient, and accurate systems at scale. Undoubtedly, optimization is the key to triumph in building real-world software solutions. While I understand that not everyone is building solutions at scale, awareness about optimization techniques is nevertheless helpful and applicable to even generic Data Science use-cases.
Therefore, in this post, I will consider one of the most sought-after functions in Pandas, i.e., Joins, and share how you can optimize the Pandas Merge operation by close to 5 times with experimental proofs. If you are unfamiliar with Joins in Pandas, I would highly recommend this post to understand them before proceeding further.
Experiment 1: Using pd.merge()

Here, we create two dummy Dataframes and merge them using the pd.merge() function – explicitly specifying which columns to execute the merge operation on.
Experiment 2: Indexing columns and using df.join()

Similar to experiment 1, we create two dummy DataFrames. However, instead of proceeding directly towards a Join, we first index the DataFrames and then adopt the JOIN method to merge the DataFrames.
Experimental setting:
- In both experiments, namely, experiment 1 and experiment 2, both the DataFrames had an equal number of rows and two columns each – one of which was used for joining the DataFrames.
- I considered different-sized DataFrames starting from a million rows to ten million rows and incremented the number of rows by 1 million with each experiment.
- I repeated an experimental trial on a fixed number of rows ten times to eliminate any randomness.
- Below, I have reported the average run-time for merge operation across these ten trials.
Results:
Now, it’s time for mind-boggling results.

- The above plot depicts the time taken (in milliseconds) to perform the merge operation in experiment 1 and the (index + merge) operation in experiment 2.
- As we see from the plot, there is a significant difference between the run-time of both experiments— experiment 2 being close to 5 times as fast as experiment 1.
- As the size of the DataFrame increases, so does the difference between the run-time of both experiments. This indicates that indexing should always be your way to proceed with larger datasets.
- Both JOIN operations almost linearly increase with the size of the DataFrame. However, the rate at which the run-time of experiment 2 increases is much lower than that of experiment 1.
To conclude, I would highly recommend everyone reading this post to immediately STOP using pd.merge() to perform JOIN operation, especially if you are someone working with large amounts of data.
Thanks for reading. I hope this post was helpful.
