The world’s leading publication for data science, AI, and ML professionals.

20 Examples to Master Merging DataFrames in Python Pandas

A comprehensive practical guide.

Merging enables combination of data from different sources into a unified structure. It’s an essential operation when working with tabular data because it’s not possible or feasible to store all data in a single data table or DataFrame.

Understanding how to effectively merge DataFrames in Pandas is a crucial skill for any data scientist or analyst.

Merging means combining DataFrames based on the values in a shared column or columns.

merging DataFrames (image by author)
merging DataFrames (image by author)

In this article, we will walk through a comprehensive set of 20 examples that will illuminate the nuances of merging operations. We will begin with basic merge functions and gradually delve into more complex scenarios, covering all the details about merging DataFrames with Pandas.


If you’d like to learn more about Pandas, visit my course 500 Exercises to Master Python Pandas.


The functions we will cover are:

  • merge
  • merge_asof
  • merge_ordered

Let’s start by creating two DataFrames to be used in the examples.

import numpy as np
import pandas as pd

names = pd.DataFrame(

    {
        "id": [1, 2, 3, 4, 10],
        "name": ["Emily", "Jane", "Joe", "Matt", "Lucas"],
        "age": np.random.randint(20, 30, size=5)
    }

)

scores = pd.DataFrame(

    {
        "id": np.arange(1, 8),
        "score": np.random.randint(80, 100, size=7),
        "group": list("ABCAACA")
    }

)
names and scores DataFrames (image by author)
names and scores DataFrames (image by author)

Example 1

The basic syntax of the merge function is as follows. The on parameter indicates the column or columns to be used when comparing rows.

merged_df = names.merge(scores, on="id")
(image by author)
(image by author)

Example 2 – how parameter

There are different merge types. The how parameter defines it from one of the following types:

  • left: use only keys from left DataFrame
  • right: use only keys from right DataFrame
  • outer: use union of keys from both DataFrames
  • inner: use intersection of keys from both DataFrames
  • cross: creates the cartesian product from both DataFrames

The default value of the how parameter is inner so in the previous example, the merged DataFrame contains an intersection of keys.

The keys are the values in the column(s) specified with the on parameter.

Let’s do a left merge.

merged_df = names.merge(scores, on="id", how="left")
(image by author)
(image by author)

The merged DataFrame includes all the keys from the left DataFrame. The non-matching rows are filled with NaN , the standard missing value representation.


Example 3 – right merge

It is the opposite of the left merge but I would not recommend using the right merge as it can be achieved by changing the order of the DataFrames and using a left merge.

# followings are the same
merged_df = names.merge(scores, on="id", how="left")
merged_df = scores.merge(names, on="id", how="right")

Example 4— outer merge

merged_df = names.merge(scores, on="id", how="outer")
(image by author)
(image by author)

The merged DataFrame includes all the keys from both DataFrames.


Example 5— indicator parameter

The indicator parameter creates a column in the merged DataFrame that indicates where the key value in rows come from.

  • both: key value exists in both DataFrames
  • left_only: only left DataFrame
  • right_only: only right DataFrame
merged_df = names.merge(scores, on="id", how="outer", indicator=True)
(image by author)
(image by author)

Example 6 – indicator parameter

The indicator parameter also takes string values as argument, which is used as the name of the column.

merged_df = names.merge(scores, on="id", how="left", indicator="source")
(image by author)
(image by author)

Example 7 – left_on and right_on parameters

If the column(s) used for merging DataFrames have different names, we can use the left_on and right_on parameters.

# rename the id column in the scores DataFrame
scores = scores.rename(columns={"id": "id_number"})

merged_df = names.merge(scores, left_on="id", right_on="id_number")
(image by author)
(image by author)

Example 8— merge on multiple columns

We will create two new DataFrames for this example.

products = pd.DataFrame(

    {
        "pg": ["A", "A", "A", "B", "B", "B"],
        "id": [101, 102, 103, 101, 102, 104],
        "price": np.random.randint(50, 80, size=6),
        "cost": np.random.randint(40, 50, size=6),
        "discount": [0.1, 0.1, 0, 0, 0.2, 0]
    }

)

sales = pd.DataFrame(

    {
        "pg": ["A", "A", "A", "B", "B", "B"],
        "id": [101, 102, 105, 101, 102, 106],
        "sales_qty": np.random.randint(1, 10, size=6),
        "discount": [0, 0.1, 0.1, 0.2, 0, 0]
    }

)
(image by author)
(image by author)

To merge DataFrames on multiple columns, we write the column names as a Python list.

merged_df = products.merge(sales, on=["pg", "id"])
(image by author)
(image by author)

Example 9— suffix parameter

In the previous example, the merged DataFrame has the columns discount_x and discount_y . The x and y suffixes are added to separate the columns that exist in both DataFrames with the same name. The x is used for the left DataFrame and y for the right.

We can use customized suffixes to make the output easier to understand.

merged_df = products.merge(sales, on=["pg", "id"], suffixes=["_products", "_sales"])
(image by author)
(image by author)

Example 10 – multiple columns

Just like the on parameter, the right_on and left_on parameters take a list as argument in the case of having different column names.

# rename the id column
sales = sales.rename(columns={"id": "product_id"})

merged_df = products.merge(
    sales, 
    left_on=["pg", "id"], 
    right_on=["pg", "product_id"],
    how="left",
    suffixes=["_products", "_sales"]
)
(image by author)
(image by author)

Example 11 – merge on index

We can also merge DataFrames on their index values. We will create two new DataFrames for this example.

df1 = pd.DataFrame(
    np.random.randint(0, 10, size=(5, 4)),
    columns=list("ABCD")
)

df2 = pd.DataFrame(
    np.random.randint(0, 10, size=(5, 4)),
    columns=list("EFGH"),
    index=[2, 3, 4, 5, 6]
)
(image by author)
(image by author)

As we see in the screenshot above, the DataFrames have different index values. One starts from 0 and the other one starts from 2.

To merge on index, we use the left_index and right_index parameters.

merged_df = df1.merge(df2, left_index=True, right_index=True)
(image by author)
(image by author)

Since we used an inner merge, the merged DataFrame includes only the indices that exist in both DataFrames.


Example 12 – how parameter with merging on index

We can use the how parameter when merging on indices as well.

merged_df = df1.merge(df2, left_index=True, right_index=True, how="left")
(image by author)
(image by author)

Example 13 – merging time-series data

Time-series data might include measurements taken at very short time periods (e.g. at the level of seconds). Therefore, when we merge two DataFrames consisting of time series data, we may encounter measurements off by a second or two.

For such cases, Pandas provide a "smart" way of merging via the merge_asof function.

Assume we are merging DataFrames A and B. If a row in the left DataFrame does not have a matching row in the right DataFrame, merge_asof allows for taking a row whose value is close to the value in the left DataFrame.

This is similar to a left-merge except that we match on the nearest key rather than equal keys. Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

  • A "backward" search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
  • A "forward" search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
  • A "nearest" search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

Let’s create two new DataFrames that contain time-series data.

df1 = pd.DataFrame(

    {
        "time": pd.date_range(start="2022-12-09", periods=7, freq="2S"),
        "left_value": np.round(np.random.random(7), 2)
    }

)

df2 = pd.DataFrame(

    {
        "time": pd.date_range(start="2022-12-09", periods=6, freq="3S"),
        "right_value": np.round(np.random.random(6), 2)
    }

)
(image by author)
(image by author)

Some values in the time column overlap whereas some others differ by seconds.

Let’s see what happens when we merge them.

merged_df = pd.merge_asof(df1, df2, on="time")
(image by author)
(image by author)

The right DataFrame (df2) does not have a value for 00:00:02 so in the merged DataFrame the value at 00:00:00 is used as the right value.


Example 14 – direction parameter

In the previous example, the merge_asof function looked for the previous value for non-matching rows because the default value of the direction parameter is "backward".

Let’s change it to "nearest" and see what happens.

merged_df = pd.merge_asof(df1, df2, on="time", direction="nearest")
(image by author)
(image by author)

The right value in the first row is 0.36 because the next value (00:00:03) is closer to the value in the left DataFrame (00:00:02) than the previous value (00:00:00).


Example 15 – tolerance parameter

We can also set a tolerance to be used when checking the previous and next values.

In the following example, direction is forward so the next value is checked for non-matching rows. We also set a tolerance of 1 second so, in order to use the next value, it needs to be off by at most 1 second.

merged_df = pd.merge_asof(
    df1, 
    df2, 
    on="time", 
    direction="forward", 
    tolerance=pd.Timedelta("1s")
)
(image by author)
(image by author)

Take a look at the third and sixth rows in the merged DataFrame. The right value is NaN because the next value in the right DataFrame is off by 2 seconds for these rows.

  • left: 00:00:04, the next value in right: 00:00:06
  • left: 00:00:10, the next value in right: 00:00:12

Example 16— allow_exact_matches parameter

We also have the option not to allow exact matches to be in the merged DataFrame. By default, the exact matches exist in the merged DataFrame but this can be changes using the allow_exact_matches parameter.

merged_df = pd.merge_asof(df1, df2, on="time", allow_exact_matches=False)
(image by author)
(image by author)

The time values in the first rows are the same but the merged DataFrame has a NaN value in the first row of the right value column because we set the value of the allow_exact_matches parameter as False.


Example 17 – by parameter

The by parameter can be used for separating groups when merging data points with the previous or next value.

Let’s add a group column to our DataFrames.

df1["group"] = ["AA"] * 4 + ["BB"] * 3

df2["group"] = ["AA"] * 3 + ["BB"] * 3
(image by author)
(image by author)

Let’s say we want to use merge_asof but only within groups so the values in a particular group cannot be merged with any value in a different group. To do this, we can use the by parameter.

merged_df = pd.merge_asof(df1, df2, on="time", by="group")
(image by author)
(image by author)

The right value in the first row for group BB is NaN . We’re merging based on "backward" direction and the previous value belongs to a different group.


Example 18 – ordered merge

The merge_ordered function performs a merge for ordered data with optional filling/interpolation. It’s designed for ordered data such as time-series.

It’s easier to understand with an example:

merged_df = pd.merge_ordered(df1, df2)
(image by author)
(image by author)

The rows are ordered by the time column. If one of the DataFrames does not have a particular time value, the columns coming from it are filled with NaN .


Example 19 – fill_method parameter

When doing an ordered merge with merge_ordered , we can use the fill_method parameter to define an interpolation method.

The default value is NaN and the only other option we can use is "ffill", which means forward fill.

merged_df = pd.merge_ordered(df1, df2, fill_method="ffill")
(image by author)
(image by author)

Compare the output with the previous example and you will notice how NaN values are replaced with the previous value.


Example 20— left_by parameter

We can also do an ordered merge within each group separately. The left_by parameter groups left DataFrame by group columns and merge piece by piece with right DataFrame.

merged_df = pd.merge_ordered(df1, df2, fill_method="ffill", left_by="group")
(image by author)
(image by author)

Unlike the previous example, the right value in the first row of group "BB" is NaN because we cannot use values from the other group.


Final words

Through these 20 examples, we’ve explored a variety of merging scenarios, from the simplest to the more complex. With these practical examples, you’re ready to tackle any merging task that comes your way.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.


Related Articles