Merging enables combination of data from different sources into a unified structure. It’s an essential operation when working with tabular data because it’s not possible or feasible to store all data in a single data table or DataFrame.
Understanding how to effectively merge DataFrames in Pandas is a crucial skill for any data scientist or analyst.
Merging means combining DataFrames based on the values in a shared column or columns.

In this article, we will walk through a comprehensive set of 20 examples that will illuminate the nuances of merging operations. We will begin with basic merge functions and gradually delve into more complex scenarios, covering all the details about merging DataFrames with Pandas.
If you’d like to learn more about Pandas, visit my course 500 Exercises to Master Python Pandas.
The functions we will cover are:
merge
merge_asof
merge_ordered
Let’s start by creating two DataFrames to be used in the examples.
import numpy as np
import pandas as pd
names = pd.DataFrame(
{
"id": [1, 2, 3, 4, 10],
"name": ["Emily", "Jane", "Joe", "Matt", "Lucas"],
"age": np.random.randint(20, 30, size=5)
}
)
scores = pd.DataFrame(
{
"id": np.arange(1, 8),
"score": np.random.randint(80, 100, size=7),
"group": list("ABCAACA")
}
)

Example 1
The basic syntax of the merge
function is as follows. The on
parameter indicates the column or columns to be used when comparing rows.
merged_df = names.merge(scores, on="id")

Example 2 – how parameter
There are different merge types. The how
parameter defines it from one of the following types:
- left: use only keys from left DataFrame
- right: use only keys from right DataFrame
- outer: use union of keys from both DataFrames
- inner: use intersection of keys from both DataFrames
- cross: creates the cartesian product from both DataFrames
The default value of the how
parameter is inner
so in the previous example, the merged DataFrame contains an intersection of keys.
The keys are the values in the column(s) specified with the on
parameter.
Let’s do a left merge.
merged_df = names.merge(scores, on="id", how="left")

The merged DataFrame includes all the keys from the left DataFrame. The non-matching rows are filled with NaN
, the standard missing value representation.
Example 3 – right merge
It is the opposite of the left merge but I would not recommend using the right merge as it can be achieved by changing the order of the DataFrames and using a left merge.
# followings are the same
merged_df = names.merge(scores, on="id", how="left")
merged_df = scores.merge(names, on="id", how="right")
Example 4— outer merge
merged_df = names.merge(scores, on="id", how="outer")

The merged DataFrame includes all the keys from both DataFrames.
Example 5— indicator parameter
The indicator
parameter creates a column in the merged DataFrame that indicates where the key value in rows come from.
- both: key value exists in both DataFrames
- left_only: only left DataFrame
- right_only: only right DataFrame
merged_df = names.merge(scores, on="id", how="outer", indicator=True)

Example 6 – indicator parameter
The indicator
parameter also takes string values as argument, which is used as the name of the column.
merged_df = names.merge(scores, on="id", how="left", indicator="source")

Example 7 – left_on and right_on parameters
If the column(s) used for merging DataFrames have different names, we can use the left_on
and right_on
parameters.
# rename the id column in the scores DataFrame
scores = scores.rename(columns={"id": "id_number"})
merged_df = names.merge(scores, left_on="id", right_on="id_number")

Example 8— merge on multiple columns
We will create two new DataFrames for this example.
products = pd.DataFrame(
{
"pg": ["A", "A", "A", "B", "B", "B"],
"id": [101, 102, 103, 101, 102, 104],
"price": np.random.randint(50, 80, size=6),
"cost": np.random.randint(40, 50, size=6),
"discount": [0.1, 0.1, 0, 0, 0.2, 0]
}
)
sales = pd.DataFrame(
{
"pg": ["A", "A", "A", "B", "B", "B"],
"id": [101, 102, 105, 101, 102, 106],
"sales_qty": np.random.randint(1, 10, size=6),
"discount": [0, 0.1, 0.1, 0.2, 0, 0]
}
)

To merge DataFrames on multiple columns, we write the column names as a Python list.
merged_df = products.merge(sales, on=["pg", "id"])

Example 9— suffix parameter
In the previous example, the merged DataFrame has the columns discount_x
and discount_y
. The x and y suffixes are added to separate the columns that exist in both DataFrames with the same name. The x
is used for the left DataFrame and y
for the right.
We can use customized suffixes to make the output easier to understand.
merged_df = products.merge(sales, on=["pg", "id"], suffixes=["_products", "_sales"])

Example 10 – multiple columns
Just like the on
parameter, the right_on
and left_on
parameters take a list as argument in the case of having different column names.
# rename the id column
sales = sales.rename(columns={"id": "product_id"})
merged_df = products.merge(
sales,
left_on=["pg", "id"],
right_on=["pg", "product_id"],
how="left",
suffixes=["_products", "_sales"]
)

Example 11 – merge on index
We can also merge DataFrames on their index values. We will create two new DataFrames for this example.
df1 = pd.DataFrame(
np.random.randint(0, 10, size=(5, 4)),
columns=list("ABCD")
)
df2 = pd.DataFrame(
np.random.randint(0, 10, size=(5, 4)),
columns=list("EFGH"),
index=[2, 3, 4, 5, 6]
)

As we see in the screenshot above, the DataFrames have different index values. One starts from 0 and the other one starts from 2.
To merge on index, we use the left_index
and right_index
parameters.
merged_df = df1.merge(df2, left_index=True, right_index=True)

Since we used an inner merge, the merged DataFrame includes only the indices that exist in both DataFrames.
Example 12 – how parameter with merging on index
We can use the how
parameter when merging on indices as well.
merged_df = df1.merge(df2, left_index=True, right_index=True, how="left")

Example 13 – merging time-series data
Time-series data might include measurements taken at very short time periods (e.g. at the level of seconds). Therefore, when we merge two DataFrames consisting of time series data, we may encounter measurements off by a second or two.
For such cases, Pandas provide a "smart" way of merging via the merge_asof
function.
Assume we are merging DataFrames A and B. If a row in the left DataFrame does not have a matching row in the right DataFrame, merge_asof
allows for taking a row whose value is close to the value in the left DataFrame.
This is similar to a left-merge except that we match on the nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
- A "forward" search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
- A "nearest" search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
Let’s create two new DataFrames that contain time-series data.
df1 = pd.DataFrame(
{
"time": pd.date_range(start="2022-12-09", periods=7, freq="2S"),
"left_value": np.round(np.random.random(7), 2)
}
)
df2 = pd.DataFrame(
{
"time": pd.date_range(start="2022-12-09", periods=6, freq="3S"),
"right_value": np.round(np.random.random(6), 2)
}
)

Some values in the time column overlap whereas some others differ by seconds.
Let’s see what happens when we merge them.
merged_df = pd.merge_asof(df1, df2, on="time")

The right DataFrame (df2) does not have a value for 00:00:02 so in the merged DataFrame the value at 00:00:00 is used as the right value.
Example 14 – direction parameter
In the previous example, the merge_asof
function looked for the previous value for non-matching rows because the default value of the direction
parameter is "backward".
Let’s change it to "nearest" and see what happens.
merged_df = pd.merge_asof(df1, df2, on="time", direction="nearest")

The right value in the first row is 0.36 because the next value (00:00:03) is closer to the value in the left DataFrame (00:00:02) than the previous value (00:00:00).
Example 15 – tolerance parameter
We can also set a tolerance to be used when checking the previous and next values.
In the following example, direction is forward so the next value is checked for non-matching rows. We also set a tolerance of 1 second so, in order to use the next value, it needs to be off by at most 1 second.
merged_df = pd.merge_asof(
df1,
df2,
on="time",
direction="forward",
tolerance=pd.Timedelta("1s")
)

Take a look at the third and sixth rows in the merged DataFrame. The right value is NaN
because the next value in the right DataFrame is off by 2 seconds for these rows.
- left: 00:00:04, the next value in right: 00:00:06
- left: 00:00:10, the next value in right: 00:00:12
Example 16— allow_exact_matches parameter
We also have the option not to allow exact matches to be in the merged DataFrame. By default, the exact matches exist in the merged DataFrame but this can be changes using the allow_exact_matches
parameter.
merged_df = pd.merge_asof(df1, df2, on="time", allow_exact_matches=False)

The time values in the first rows are the same but the merged DataFrame has a NaN
value in the first row of the right value column because we set the value of the allow_exact_matches
parameter as False.
Example 17 – by parameter
The by
parameter can be used for separating groups when merging data points with the previous or next value.
Let’s add a group column to our DataFrames.
df1["group"] = ["AA"] * 4 + ["BB"] * 3
df2["group"] = ["AA"] * 3 + ["BB"] * 3

Let’s say we want to use merge_asof
but only within groups so the values in a particular group cannot be merged with any value in a different group. To do this, we can use the by
parameter.
merged_df = pd.merge_asof(df1, df2, on="time", by="group")

The right value in the first row for group BB is NaN
. We’re merging based on "backward" direction and the previous value belongs to a different group.
Example 18 – ordered merge
The merge_ordered
function performs a merge for ordered data with optional filling/interpolation. It’s designed for ordered data such as time-series.
It’s easier to understand with an example:
merged_df = pd.merge_ordered(df1, df2)

The rows are ordered by the time column. If one of the DataFrames does not have a particular time value, the columns coming from it are filled with NaN
.
Example 19 – fill_method parameter
When doing an ordered merge with merge_ordered
, we can use the fill_method
parameter to define an interpolation method.
The default value is NaN
and the only other option we can use is "ffill", which means forward fill.
merged_df = pd.merge_ordered(df1, df2, fill_method="ffill")

Compare the output with the previous example and you will notice how NaN
values are replaced with the previous value.
Example 20— left_by parameter
We can also do an ordered merge within each group separately. The left_by
parameter groups left DataFrame by group columns and merge piece by piece with right DataFrame.
merged_df = pd.merge_ordered(df1, df2, fill_method="ffill", left_by="group")

Unlike the previous example, the right value in the first row of group "BB" is NaN
because we cannot use values from the other group.
Final words
Through these 20 examples, we’ve explored a variety of merging scenarios, from the simplest to the more complex. With these practical examples, you’re ready to tackle any merging task that comes your way.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
Thank you for reading. Please let me know if you have any feedback.