It’s Time To Say Goodbye To The Merge Method in Pandas

Why I stopped using the merge method in Pandas and why you should too

Avi Chawla
Towards Data Science

--

Photo by Alain Pham on Unsplash

The merge() method in Pandas is undoubtedly among the most frequently used methods by data scientists in their data science projects.

Derived from the idea of table joins in SQL and extended to joining tables in a pythonic environment, the method merges two Pandas DataFrames based on the matching values in one or more columns.

This is illustrated in the diagram below:

Diagrammatic Overview of joining tables (Image by Author)

The intuitive nature of the merge() method makes it ideal for Pandas users to join DataFrames.

However, when it comes to the run-time, there is a relatively better alternative available in Pandas that you should prefer over the merge() method.

You can find the code for this article here.

Let’s explore 🚀!

Methods to Merge Tables

Method 1: Using merge()

As discussed above, the traditional and the most commonplace way of merging DataFrames in Pandas is using the merge() method.

As demonstrated in the code block above, the method accepts two DataFrames, df1 and df2.

Further, we specify the kind of join we wish to perform using the how argument ("left" in the example above).

Lastly, we specify the columns to be considered for matching values from the first DataFrame (df1) with the left_on argument and that from the second DataFrame (df2) using the right_on argument.

Method 2: Using join()

The join() method is similar to the merge() method in Pandas in terms of its objective but with a few differences in the implementation.

  1. The join() method performs a lookup at the index of df2 and df1. However, the merge() method is primarily used to join using entries in a column.
  2. The join() method performs a left join by default. Whereas the merge() method resorts to an inner join in its default behavior.
Joining tables on Index values (Image by Author)

The code block below demonstrates the join() method.

As specified above, the join() method performs an index lookup to join two DataFrames. That is, rows corresponding to the same index values are merged.

Therefore, while using the join() method, you should first set the column(s) you wish to execute join on as the index of the DataFrame and then call the join() method.

Experimental Setup

To evaluate the run-time performance of the merge() method in Pandas, we shall compare it with the join() method.

Specifically, we will create two dummy DataFrames and perform a join using both the methods — merge() and join().

The implementation of this experiment is shown below:

  • First, we set the values of the integers from (-high, +high). We shall compare the performance of both the methods on different sizes of the DataFrame with the number of rows from rows_list and columns as n_columns. Lastly, we shall run each experiment repeat times.
  • The create_df method accepts a series of arguments and returns a random DataFrame.
  • In the code below, we measure the run-time of the merge() method and the join() method on the same DataFrame df1 and df2.

Note that to use the join() method, you should first set the column(s) as the index of the DataFrame.

Results

Next, let’s look at the results.

Experimental results of Join vs Merge Method (Image by Author)
  • The blue line-plot depicts the run-time of the merge() method, and the yellow line-plot represents the run-time of the join() method.
  • We vary the number of rows from 1 million to 10 Million and notice that the run-time of both methods is positively correlated with the number of rows.
  • However, the join() method provides significant improvement in the run-time over the traditional merge() method.
  • As the number of rows increases, so does the difference between the run-time of both methods. This indicates that you should always use the join() method to merge DataFrames, especially in the case of larger datasets.

Conclusion

To conclude, in this post, we compared the performance of the Pandas’ merge() and join() method on a dummy DataFrame.

Experimental results suggest that merging on the index column using the join() method is efficient in terms of run-time over the merge() method — providing a performance boost of up to 4 to 5 times.

You can find the code for this article here.

Thanks for reading!

🧑‍💻 Become a Data Science PRO! Get the FREE Data Science Mastery Toolkit with 450+ Pandas, NumPy, and SQL questions.

✉️ Sign-up to my Email list to never miss another article on data science guides, tricks and tips, Machine Learning, SQL, Python, and more. Medium will deliver my next articles right to your inbox.

--

--

👉 Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.