The world’s leading publication for data science, AI, and ML professionals.

How To Merge Pandas DataFrames

Performing left, right, inner and anti joins with pandas DataFrames

Photo by Anders Jildén on Unsplash
Photo by Anders Jildén on Unsplash

Introduction

Usually, we may have to merge together pandas DataFrames in order to build a new DataFrame containing columns and rows from the involved parties, based on some logic that will eventually serve the purpose of the task we are working on.

In today’s article we will showcase how to merge pandas DataFrames together and perform LEFT, RIGHT, INNER, OUTER, FULL and ANTI joins. Pandas merging is the equivalent of joins in SQL and we will take an SQL-flavoured approach to explain merging as this will help even new-comers follow along. More specifically, we will showcase how to perform

  • LEFT JOIN (aka LEFT OUTER JOIN)
  • RIGHT JOIN (aka RIGHT OUTER JOIN)
  • INNER JOIN
  • FULL OUTER JOIN
  • LEFT ANTI-JOIN (aka LEFT-EXCLUDING JOIN)
  • RIGHT ANTI-JOIN (aka RIGHT-EXCLUDING JOIN)
  • FULL ANTI-JOIN

Apart from the different join/merge types, in the sections below we will also cover how to

  • merge on a single column (with the same name on both dfs)
  • merge on multiple columns
  • merge on columns with different name
  • rename mutual column names used in the join
  • select only some columns from the DataFrames involved in the join

First, let’s create a couple of DataFrames that will be using throughout this tutorial in order to demonstrate the various join types we will be discussing today.

import pandas as pd
df1 = pd.DataFrame(
    [
       (1, 345, 'B', True),
       (2, 100, 'C', False),
       (3, 300, 'B', False),
       (4, 151, 'A', False),
       (5, 212, 'A', True),
       (6, 121, 'C', False),
       (7, 333, 'B', True),
       (8, 456, 'C', True),
    ],
    columns=['id', 'value', 'colC', 'colD']
)
df2 = pd.DataFrame(
    [
       (1, 111, 10.1, 3),
       (9, 56, 3.33, 10),
       (10, 17, 18.0, 8),
       (3, 567, 19.1, 4),
       (11, 98, 2.1, 1),
       (6, 31, 3.14, 12),
    ],
    columns=['id', 'value', 'colE', 'colF']
)
print(df1)
   id  value colC   colD
0   1    345    B   True
1   2    100    C  False
2   3    300    B  False
3   4    151    A  False
4   5    212    A   True
5   6    121    C  False
6   7    333    B   True
7   8    456    C   True
print(df2)
   id  value   colE  colF
0   1    111  10.10     3
1   9     56   3.33    10
2  10     17  18.00     8
3   3    567  19.10     4
4  11     98   2.10     1
5   6     31   3.14    12

INNER JOIN

An INNER JOIN between two pandas DataFrames will result into a set of records that have a mutual value in the specified joining column(s).

INNER JOIN: Use intersection of keys from both frames

Selected records when an INNER JOIN is performed over two DataFrames - Source: Author
Selected records when an INNER JOIN is performed over two DataFrames – Source: Author

In order to perform an inner join between two DataFrames using a single column, all we need is to provide the on argument when calling merge().

df1.merge(df2, on='id')

Note that by default, the [merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) method performs an inner join (how='inner') and thus you don’t have to specify the join type explicitly. The output will contain all the records that have a mutual id in both df1 and df2:

   id  value_x colC   colD  value_y   colE  colF
0   1      345    B   True      111  10.10     3
1   3      300    B  False      567  19.10     4
2   6      121    C  False       31   3.14    12

LEFT JOIN (aka LEFT OUTER JOIN)

The LEFT JOIN (or LEFT OUTER JOIN) will take all the records from the left DataFrame along with records from the right DataFrame that have matching values with the left one, over the specified joining column(s).

Any missing value from the records of the right DataFrame that are included in the result, will be replaced with NaN.

LEFT OUTER JOIN: Use keys from the left frame only

Selected records when a LEFT JOIN is performed over two DataFrames - Source: Author
Selected records when a LEFT JOIN is performed over two DataFrames – Source: Author

To perform a left join between two pandas DataFrames, you now to specify how='left' when calling merge().

df1.merge(df2, on='id', how='left')

As mentioned, the resulting DataFrame will contain every record from the left DataFrame along with the corresponding values from the right DataFrame for these records that match the joining column. The remaining column values of the result for these records that didn’t match with a record from the right DataFrame will be replaced by NaNs.

   id  value_x colC   colD  value_y   colE  colF
0   1      345    B   True    111.0  10.10   3.0
1   2      100    C  False      NaN    NaN   NaN
2   3      300    B  False    567.0  19.10   4.0
3   4      151    A  False      NaN    NaN   NaN
4   5      212    A   True      NaN    NaN   NaN
5   6      121    C  False     31.0   3.14  12.0
6   7      333    B   True      NaN    NaN   NaN
7   8      456    C   True      NaN    NaN   NaN

RIGHT JOIN (aka RIGHT OUTER JOIN)

The RIGHT JOIN(or RIGHT OUTER JOIN) will take all the records from the right DataFrame along with records from the left DataFrame that have matching values with the right one, over the specified joining column(s).

Any missing value from the records of the left DataFrame that are included in the result, will be replaced with NaN.

RIGHT OUTER JOIN: Use keys from the right frame only

Selected records when a RIGHT JOIN is performed over two DataFrames - Source: Author
Selected records when a RIGHT JOIN is performed over two DataFrames – Source: Author

To perform a left join between two pandas DataFrames, you now to specify how='right' when calling merge().

df1.merge(df2, on='id', how='right')

The result of a right join between df1 and df2 DataFrames is shown below.

   id  value_x colC   colD  value_y   colE  colF
0   1    345.0    B   True      111  10.10     3
1   9      NaN  NaN    NaN       56   3.33    10
2  10      NaN  NaN    NaN       17  18.00     8
3   3    300.0    B  False      567  19.10     4
4  11      NaN  NaN    NaN       98   2.10     1
5   6    121.0    C  False       31   3.14    12

FULL OUTER JOIN

The FULL OUTER JOIN will essentially include all the records from both the left and right DataFrame. This type of join will uses the keys from both frames – for any missing rows, NaN values will be inserted.

FULL OUTER JOIN: Use union of keys from both frames

Selected records when a FULL OUTER JOIN is performed over two DataFrames - Source: Author
Selected records when a FULL OUTER JOIN is performed over two DataFrames – Source: Author

To perform a full outer join between two pandas DataFrames, you now to specify how='outer' when calling merge().

df1.merge(df2, on='id', how='outer')

The output of a full outer join using our two example frames is shown below.

    id  value_x colC   colD  value_y   colE  colF
0    1    345.0    B   True    111.0  10.10   3.0
1    2    100.0    C  False      NaN    NaN   NaN
2    3    300.0    B  False    567.0  19.10   4.0
3    4    151.0    A  False      NaN    NaN   NaN
4    5    212.0    A   True      NaN    NaN   NaN
5    6    121.0    C  False     31.0   3.14  12.0
6    7    333.0    B   True      NaN    NaN   NaN
7    8    456.0    C   True      NaN    NaN   NaN
8    9      NaN  NaN    NaN     56.0   3.33  10.0
9   10      NaN  NaN    NaN     17.0  18.00   8.0
10  11      NaN  NaN    NaN     98.0   2.10   1.0

LEFT ANTI-JOIN (aka LEFT-EXCLUDING JOIN)

A LEFT ANTI-JOIN will contain all the records of the left frame whose keys don’t appear in the right frame.

LEFT ANTI-JOIN: Use only keys from the left frame that don’t appear in the right frame

Selected records when a LEFT ANTI-JOIN is performed over two DataFrames - Source: Author
Selected records when a LEFT ANTI-JOIN is performed over two DataFrames – Source: Author

A left anti-join in pandas can be performed in two steps. In the first step, we need to perform a LEFT OUTER JOIN with indicator=True:

indicator bool or str, default False

If True, adds a column to the output DataFrame called '_merge' with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of 'left_only' for observations whose merge key only appears in the left DataFrame, 'right_only' for observations whose merge key only appears in the right DataFrame, and 'both' if the observation’s merge key is found in both DataFrames.

  • Python Docs
df1.merge(df2, on='id', how='left', indicator=True)
   id  value_x colC   colD  value_y   colE  colF     _merge
0   1      345    B   True    111.0  10.10   3.0       both
1   2      100    C  False      NaN    NaN   NaN  left_only
2   3      300    B  False    567.0  19.10   4.0       both
3   4      151    A  False      NaN    NaN   NaN  left_only
4   5      212    A   True      NaN    NaN   NaN  left_only
5   6      121    C  False     31.0   3.14  12.0       both
6   7      333    B   True      NaN    NaN   NaN  left_only
7   8      456    C   True      NaN    NaN   NaN  left_only

In the second step, we simply need to query() the result from the previous expression in order to keep only rows coming from the left frame only, and filter out those that also appear in the right frame.

If we combine both steps together, the resulting expression will be

df1.merge(df2, on='id', how='left', indicator=True) 
    .query('_merge == "left_only"') 
    .drop('_merge', 1)

And the corresponding result

   id  value_x colC   colD  value_y  colE  colF
1   2      100    C  False      NaN   NaN   NaN
3   4      151    A  False      NaN   NaN   NaN
4   5      212    A   True      NaN   NaN   NaN
6   7      333    B   True      NaN   NaN   NaN
7   8      456    C   True      NaN   NaN   NaN

RIGHT ANTI-JOIN (aka RIGHT-EXCLUDING JOIN)

Similarly, a RIGHT ANTI-JOIN will contain all the records of the right frame whose keys don’t appear in the left frame.

RIGHT ANTI-JOIN: Use only keys from the right frame that don’t appear in the left frame

Selected records when a RIGHT ANTI-JOIN is performed over two DataFrames - Source: Author
Selected records when a RIGHT ANTI-JOIN is performed over two DataFrames – Source: Author

A right anti-join in pandas can be performed in two steps. In the first step, we need to perform a Right Outer Join with indicator=True:

indicator bool or str, default False

If True, adds a column to the output DataFrame called _'_merge' with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of 'left_only' for observations whose merge key only appears in the left DataFrame, 'right_only' for observations whose merge key only appears in the right DataFrame, and 'both'_ if the observation’s merge key is found in both DataFrames.

  • Python Docs
df1.merge(df2, on='id', how='right', indicator=True)
   id  value_x colC   colD  value_y   colE  colF      _merge
0   1    345.0    B   True      111  10.10     3        both
1   9      NaN  NaN    NaN       56   3.33    10  right_only
2  10      NaN  NaN    NaN       17  18.00     8  right_only
3   3    300.0    B  False      567  19.10     4        both
4  11      NaN  NaN    NaN       98   2.10     1  right_only
5   6    121.0    C  False       31   3.14    12        both

In the second step, we simply need to query() the result from the previous expression in order to keep only rows coming from the right frame only, and filter out those that also appear in the left frame.

If we combine both steps together, the resulting expression will be

df1.merge(df2, on='id', how='right', indicator=True) 
  .query('_merge == "right_only"') 
  .drop('_merge', 1)

And the resulting frame using our example DataFrames will be

   id  value_x colC colD  value_y   colE  colF
1   9      NaN  NaN  NaN       56   3.33    10
2  10      NaN  NaN  NaN       17  18.00     8
4  11      NaN  NaN  NaN       98   2.10     1

FULL ANTI-JOIN

A FULL ANTI-JOIN will contain all the records from both the left and right frames that don’t have any common keys.

FULL ANTI-JOIN: Take the symmetric difference of the keys of both frames

Selected records when a FULL ANTI-JOIN is performed over two DataFrames - Source: Author
Selected records when a FULL ANTI-JOIN is performed over two DataFrames – Source: Author

Again, this can be performed in two steps like the two previous anti-join types we discussed.

df1.merge(df2, on='id', how='outer', indicator=True) 
  .query('_merge != "both"') 
  .drop('_merge', 1)

And the result using our example frames is shown below.

    id  value_x colC   colD  value_y   colE  colF
1    2    100.0    C  False      NaN    NaN   NaN
3    4    151.0    A  False      NaN    NaN   NaN
4    5    212.0    A   True      NaN    NaN   NaN
6    7    333.0    B   True      NaN    NaN   NaN
7    8    456.0    C   True      NaN    NaN   NaN
8    9      NaN  NaN    NaN     56.0   3.33  10.0
9   10      NaN  NaN    NaN     17.0  18.00   8.0
10  11      NaN  NaN    NaN     98.0   2.10   1.0

Changing the suffix of columns with the same name

Also note how the column(s) with the same name are automatically renamed using the _x and _y suffices respectively. You can change the default values by providing the suffixes argument with the desired values. For example,

df1.merge(df2, on='id', suffixes=('_df1', '_df2'))

Now every column from the left and right DataFrames that were involved in the join, will have the specified suffix.

   id  value_df1 colC   colD  value_df2   colE  colF
0   1        345    B   True        111  10.10     3
1   3        300    B  False        567  19.10     4
2   6        121    C  False         31   3.14    12

Merging on different column names

Now let’s consider another use-case, where the columns that we want to merge two pandas DataFrames don’t have the same name. In this case, instead of providing the on argument, we have to provide left_on and right_on arguments to specify the columns of the left and right DataFrames to be considered when merging them together.

As an example, let’s suppose we want to merge df1 and df2 based on the id and colF columns respectively. The following command will do the trick:

df1.merge(df2, left_on='id', right_on='colF')

And the resulting DataFrame will look as below

   id_x  value_x colC   colD  id_y  value_y  colE  colF
0     1      345    B   True    11       98   2.1     1
1     3      300    B  False     1      111  10.1     3
2     4      151    A  False     3      567  19.1     4
3     8      456    C   True    10       17  18.0     8

Merging on multiple columns

If you want to merge on multiple columns, you can simply pass all the desired columns into the on argument as a list:

df1.merge(df2, on=['colA', 'colB', ..])

If the columns in the left and right frame have different names then once again, you can make use of right_on and left_on arguments:

df1.merge(df2, left_on=['colA', 'colB'], right_on=['colC', 'colD])

Selecting only some of the columns from the DataFrames involved

Now let’s say that we want to merge together frames df1 and df2 using a left outer join, select all the columns from df1 but only column colE from df2.

In order to do so, you can simply use a subset of df2 columns when passing the frame into the merge() method.

df1.merge(df2[['id', 'colE']], on='id')

This is going to exclude all columns but colE from the right frame:

   id  value colC   colD   colE
0   1    345    B   True  10.10
1   3    300    B  False  19.10
2   6    121    C  False   3.14

Final Thoughts

In this tutorial we discussed about merging pandas DataFrames and how to perform LEFT OUTER, RIGHT OUTER, INNER, FULL OUTER, LEFT ANTI, RIGHT ANTI and FULL ANTI joins.

Additionally, we also discussed a few other use cases including how to join on columns with a different name or even on multiple columns. Furthermore, we also showcased how to change the suffix of the column names that are having the same name as well as how to select only a subset of columns from the left or right DataFrame once the merge is performed.

You can use this article as a cheatsheet every time you want to perform some joins between pandas DataFrames so fell free to save this article or create a bookmark on your browser!


Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Join Medium with my referral link – Giorgos Myrianthous


Related articles you may also like

apply() vs map() vs applymap() in Pandas


Speeding Up the Conversion Between PySpark and Pandas DataFrames


loc vs iloc in Pandas


Related Articles