
Introduction
Usually, we may have to merge together pandas DataFrames in order to build a new DataFrame containing columns and rows from the involved parties, based on some logic that will eventually serve the purpose of the task we are working on.
In today’s article we will showcase how to merge pandas DataFrames together and perform LEFT
, RIGHT
, INNER
, OUTER
, FULL
and ANTI
joins. Pandas merging is the equivalent of joins in SQL and we will take an SQL-flavoured approach to explain merging as this will help even new-comers follow along. More specifically, we will showcase how to perform
- LEFT JOIN (aka LEFT OUTER JOIN)
- RIGHT JOIN (aka RIGHT OUTER JOIN)
- INNER JOIN
- FULL OUTER JOIN
- LEFT ANTI-JOIN (aka LEFT-EXCLUDING JOIN)
- RIGHT ANTI-JOIN (aka RIGHT-EXCLUDING JOIN)
- FULL ANTI-JOIN
Apart from the different join/merge types, in the sections below we will also cover how to
- merge on a single column (with the same name on both dfs)
- merge on multiple columns
- merge on columns with different name
- rename mutual column names used in the join
- select only some columns from the DataFrames involved in the join
First, let’s create a couple of DataFrames that will be using throughout this tutorial in order to demonstrate the various join types we will be discussing today.
import pandas as pd
df1 = pd.DataFrame(
[
(1, 345, 'B', True),
(2, 100, 'C', False),
(3, 300, 'B', False),
(4, 151, 'A', False),
(5, 212, 'A', True),
(6, 121, 'C', False),
(7, 333, 'B', True),
(8, 456, 'C', True),
],
columns=['id', 'value', 'colC', 'colD']
)
df2 = pd.DataFrame(
[
(1, 111, 10.1, 3),
(9, 56, 3.33, 10),
(10, 17, 18.0, 8),
(3, 567, 19.1, 4),
(11, 98, 2.1, 1),
(6, 31, 3.14, 12),
],
columns=['id', 'value', 'colE', 'colF']
)
print(df1)
id value colC colD
0 1 345 B True
1 2 100 C False
2 3 300 B False
3 4 151 A False
4 5 212 A True
5 6 121 C False
6 7 333 B True
7 8 456 C True
print(df2)
id value colE colF
0 1 111 10.10 3
1 9 56 3.33 10
2 10 17 18.00 8
3 3 567 19.10 4
4 11 98 2.10 1
5 6 31 3.14 12
INNER JOIN
An INNER JOIN between two pandas DataFrames will result into a set of records that have a mutual value in the specified joining column(s).
INNER JOIN: Use intersection of keys from both frames

In order to perform an inner join between two DataFrames using a single column, all we need is to provide the on
argument when calling merge()
.
df1.merge(df2, on='id')
Note that by default, the [merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
method performs an inner join (how='inner'
) and thus you don’t have to specify the join type explicitly. The output will contain all the records that have a mutual id in both df1
and df2
:
id value_x colC colD value_y colE colF
0 1 345 B True 111 10.10 3
1 3 300 B False 567 19.10 4
2 6 121 C False 31 3.14 12
LEFT JOIN (aka LEFT OUTER JOIN)
The LEFT JOIN (or LEFT OUTER JOIN) will take all the records from the left DataFrame along with records from the right DataFrame that have matching values with the left one, over the specified joining column(s).
Any missing value from the records of the right DataFrame that are included in the result, will be replaced with NaN
.
LEFT OUTER JOIN: Use keys from the left frame only

To perform a left join between two pandas DataFrames, you now to specify how='left'
when calling merge()
.
df1.merge(df2, on='id', how='left')
As mentioned, the resulting DataFrame will contain every record from the left DataFrame along with the corresponding values from the right DataFrame for these records that match the joining column. The remaining column values of the result for these records that didn’t match with a record from the right DataFrame will be replaced by NaN
s.
id value_x colC colD value_y colE colF
0 1 345 B True 111.0 10.10 3.0
1 2 100 C False NaN NaN NaN
2 3 300 B False 567.0 19.10 4.0
3 4 151 A False NaN NaN NaN
4 5 212 A True NaN NaN NaN
5 6 121 C False 31.0 3.14 12.0
6 7 333 B True NaN NaN NaN
7 8 456 C True NaN NaN NaN
RIGHT JOIN (aka RIGHT OUTER JOIN)
The RIGHT JOIN(or RIGHT OUTER JOIN) will take all the records from the right DataFrame along with records from the left DataFrame that have matching values with the right one, over the specified joining column(s).
Any missing value from the records of the left DataFrame that are included in the result, will be replaced with NaN
.
RIGHT OUTER JOIN: Use keys from the right frame only

To perform a left join between two pandas DataFrames, you now to specify how='right'
when calling merge()
.
df1.merge(df2, on='id', how='right')
The result of a right join between df1
and df2
DataFrames is shown below.
id value_x colC colD value_y colE colF
0 1 345.0 B True 111 10.10 3
1 9 NaN NaN NaN 56 3.33 10
2 10 NaN NaN NaN 17 18.00 8
3 3 300.0 B False 567 19.10 4
4 11 NaN NaN NaN 98 2.10 1
5 6 121.0 C False 31 3.14 12
FULL OUTER JOIN
The FULL OUTER JOIN will essentially include all the records from both the left and right DataFrame. This type of join will uses the keys from both frames – for any missing rows, NaN
values will be inserted.
FULL OUTER JOIN: Use union of keys from both frames

To perform a full outer join between two pandas DataFrames, you now to specify how='outer'
when calling merge()
.
df1.merge(df2, on='id', how='outer')
The output of a full outer join using our two example frames is shown below.
id value_x colC colD value_y colE colF
0 1 345.0 B True 111.0 10.10 3.0
1 2 100.0 C False NaN NaN NaN
2 3 300.0 B False 567.0 19.10 4.0
3 4 151.0 A False NaN NaN NaN
4 5 212.0 A True NaN NaN NaN
5 6 121.0 C False 31.0 3.14 12.0
6 7 333.0 B True NaN NaN NaN
7 8 456.0 C True NaN NaN NaN
8 9 NaN NaN NaN 56.0 3.33 10.0
9 10 NaN NaN NaN 17.0 18.00 8.0
10 11 NaN NaN NaN 98.0 2.10 1.0
LEFT ANTI-JOIN (aka LEFT-EXCLUDING JOIN)
A LEFT ANTI-JOIN will contain all the records of the left frame whose keys don’t appear in the right frame.
LEFT ANTI-JOIN: Use only keys from the left frame that don’t appear in the right frame

A left anti-join in pandas can be performed in two steps. In the first step, we need to perform a LEFT OUTER JOIN with indicator=True
:
indicator
bool
orstr
, defaultFalse
If
True
, adds a column to the output DataFrame called'_merge'
with information on the source of each row. The column can be given a different name by providing a string argument. The column will have aCategorical
type with the value of'left_only'
for observations whose merge key only appears in the left DataFrame,'right_only'
for observations whose merge key only appears in the right DataFrame, and'both'
if the observation’s merge key is found in both DataFrames.
- Python Docs
df1.merge(df2, on='id', how='left', indicator=True)
id value_x colC colD value_y colE colF _merge
0 1 345 B True 111.0 10.10 3.0 both
1 2 100 C False NaN NaN NaN left_only
2 3 300 B False 567.0 19.10 4.0 both
3 4 151 A False NaN NaN NaN left_only
4 5 212 A True NaN NaN NaN left_only
5 6 121 C False 31.0 3.14 12.0 both
6 7 333 B True NaN NaN NaN left_only
7 8 456 C True NaN NaN NaN left_only
In the second step, we simply need to query()
the result from the previous expression in order to keep only rows coming from the left frame only, and filter out those that also appear in the right frame.
If we combine both steps together, the resulting expression will be
df1.merge(df2, on='id', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1)
And the corresponding result
id value_x colC colD value_y colE colF
1 2 100 C False NaN NaN NaN
3 4 151 A False NaN NaN NaN
4 5 212 A True NaN NaN NaN
6 7 333 B True NaN NaN NaN
7 8 456 C True NaN NaN NaN
RIGHT ANTI-JOIN (aka RIGHT-EXCLUDING JOIN)
Similarly, a RIGHT ANTI-JOIN will contain all the records of the right frame whose keys don’t appear in the left frame.
RIGHT ANTI-JOIN: Use only keys from the right frame that don’t appear in the left frame

A right anti-join in pandas can be performed in two steps. In the first step, we need to perform a Right Outer Join with indicator=True
:
indicator
bool
orstr
, defaultFalse
If
True
, adds a column to the output DataFrame called _'_merge'
with information on the source of each row. The column can be given a different name by providing a string argument. The column will have aCategorical
type with the value of'left_only'
for observations whose merge key only appears in the left DataFrame,'right_only'
for observations whose merge key only appears in the right DataFrame, and'both'
_ if the observation’s merge key is found in both DataFrames.
- Python Docs
df1.merge(df2, on='id', how='right', indicator=True)
id value_x colC colD value_y colE colF _merge
0 1 345.0 B True 111 10.10 3 both
1 9 NaN NaN NaN 56 3.33 10 right_only
2 10 NaN NaN NaN 17 18.00 8 right_only
3 3 300.0 B False 567 19.10 4 both
4 11 NaN NaN NaN 98 2.10 1 right_only
5 6 121.0 C False 31 3.14 12 both
In the second step, we simply need to query()
the result from the previous expression in order to keep only rows coming from the right frame only, and filter out those that also appear in the left frame.
If we combine both steps together, the resulting expression will be
df1.merge(df2, on='id', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1)
And the resulting frame using our example DataFrames will be
id value_x colC colD value_y colE colF
1 9 NaN NaN NaN 56 3.33 10
2 10 NaN NaN NaN 17 18.00 8
4 11 NaN NaN NaN 98 2.10 1
FULL ANTI-JOIN
A FULL ANTI-JOIN will contain all the records from both the left and right frames that don’t have any common keys.
FULL ANTI-JOIN: Take the symmetric difference of the keys of both frames

Again, this can be performed in two steps like the two previous anti-join types we discussed.
df1.merge(df2, on='id', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1)
And the result using our example frames is shown below.
id value_x colC colD value_y colE colF
1 2 100.0 C False NaN NaN NaN
3 4 151.0 A False NaN NaN NaN
4 5 212.0 A True NaN NaN NaN
6 7 333.0 B True NaN NaN NaN
7 8 456.0 C True NaN NaN NaN
8 9 NaN NaN NaN 56.0 3.33 10.0
9 10 NaN NaN NaN 17.0 18.00 8.0
10 11 NaN NaN NaN 98.0 2.10 1.0
Changing the suffix of columns with the same name
Also note how the column(s) with the same name are automatically renamed using the _x
and _y
suffices respectively. You can change the default values by providing the suffixes
argument with the desired values. For example,
df1.merge(df2, on='id', suffixes=('_df1', '_df2'))
Now every column from the left and right DataFrames that were involved in the join, will have the specified suffix.
id value_df1 colC colD value_df2 colE colF
0 1 345 B True 111 10.10 3
1 3 300 B False 567 19.10 4
2 6 121 C False 31 3.14 12
Merging on different column names
Now let’s consider another use-case, where the columns that we want to merge two pandas DataFrames don’t have the same name. In this case, instead of providing the on
argument, we have to provide left_on
and right_on
arguments to specify the columns of the left and right DataFrames to be considered when merging them together.
As an example, let’s suppose we want to merge df1
and df2
based on the id
and colF
columns respectively. The following command will do the trick:
df1.merge(df2, left_on='id', right_on='colF')
And the resulting DataFrame will look as below
id_x value_x colC colD id_y value_y colE colF
0 1 345 B True 11 98 2.1 1
1 3 300 B False 1 111 10.1 3
2 4 151 A False 3 567 19.1 4
3 8 456 C True 10 17 18.0 8
Merging on multiple columns
If you want to merge on multiple columns, you can simply pass all the desired columns into the on
argument as a list:
df1.merge(df2, on=['colA', 'colB', ..])
If the columns in the left and right frame have different names then once again, you can make use of right_on
and left_on
arguments:
df1.merge(df2, left_on=['colA', 'colB'], right_on=['colC', 'colD])
Selecting only some of the columns from the DataFrames involved
Now let’s say that we want to merge together frames df1
and df2
using a left outer join, select all the columns from df1
but only column colE
from df2
.
In order to do so, you can simply use a subset of df2
columns when passing the frame into the merge()
method.
df1.merge(df2[['id', 'colE']], on='id')
This is going to exclude all columns but colE
from the right frame:
id value colC colD colE
0 1 345 B True 10.10
1 3 300 B False 19.10
2 6 121 C False 3.14
Final Thoughts
In this tutorial we discussed about merging pandas DataFrames and how to perform LEFT OUTER, RIGHT OUTER, INNER, FULL OUTER, LEFT ANTI, RIGHT ANTI and FULL ANTI joins.
Additionally, we also discussed a few other use cases including how to join on columns with a different name or even on multiple columns. Furthermore, we also showcased how to change the suffix of the column names that are having the same name as well as how to select only a subset of columns from the left or right DataFrame once the merge is performed.
You can use this article as a cheatsheet every time you want to perform some joins between pandas DataFrames so fell free to save this article or create a bookmark on your browser!
Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.
Related articles you may also like
Speeding Up the Conversion Between PySpark and Pandas DataFrames