The world’s leading publication for data science, AI, and ML professionals.

Joining Pandas DataFrames

Learn how to merge Pandas Dataframes easily

Photo by CHUTTERSNAP on Unsplash
Photo by CHUTTERSNAP on Unsplash

Very often, your data comes from different sources. In order to help with your analytics, you often need to combine data from different sources so that you can obtain the data you need. In this article, I will talk about how you can merge (join) Pandas Dataframes. Most articles on this topic use simplistic dataframes to illustrate concepts on dataframe joining – inner, outer, left, and right join. For me, a much better way to understand this topic is to use a more realistic example so that you can understand and be able to better retain the concepts.

Let’s get started!

Creating the DataFrames

The first thing to do is to create the two data frames. The first creates a list of flight numbers and the airports they are departing from:

import pandas as pd
df_flights = pd.DataFrame(
    dict(
        AIRPORT_CODE=['MIA','MIA','LAX','DCA','SIN'],
        FLIGHT_NO=['3322','3213','4223','5678','1234']
    )
)
df_flights

The df_flights dataframe looks like this:

The next dataframe contains a list of airport codes and their respective airport names:

df_airports = pd.DataFrame(
    dict(
        AIRPORT_CODE=['MIA','LAX','DCA','HEL','SFO'],
        AIRPORT_NAME=['Miami International Airport',
                      'Los Angeles International Airport',
                      'Ronald Reagan Washington',
                      'Helsinki-Vantaa Airport',
                      'San Francisco International Airport']
    )
)
df_airports

The df_airports dataframe looks like this:

Listing the airport names of all the flights

Suppose you need to get all the departing airport names for each flight. Since the information is in two dataframes, you need to join these two dataframes. In Pandas, you join dataframes using the merge() method. For this requirement, you can perform a ‘left‘ join on the two dataframes based on the AIRPORT_CODE column:

pd.merge(df_flights, df_airports, on='AIRPORT_CODE', how='left')

The ‘on‘ parameter specifies the column to join on. The ‘left‘ refers to the first dataframe – df_flights. The result of the above join function is as follows:

Note that all the rows from the df_flights dataframe (the ‘left‘ join) are in the result. Also observe that as the airport code SIN does not have an entry in the df_airports dataframe, so it has a NaN value for the AIRPORT_NAME column.

The ‘left’ join ensures that the result has all the airport codes that are available in the first dataframe.

Listing all the flights from the various airports

What if you want to list all the flights from all the airports? In this case, you can perform a ‘right‘ join:

pd.merge(df_flights, df_airports, on='AIRPORT_CODE', how='right')

The ‘right‘ refers to the second dataframe – df_airports. The result of the above join function is as follows:

Observe that the result now contains all the airports contained in the df_airports dataframe (the ‘right‘ join). In addition, as there are two flights from the MIA airport, the result would contain two rows for the MIA airport. Also, since the HEL and SFO airports do not have any departing flights in the df_flights dataframe, the FLIGHT_NO column for the result would contain NaNs.

The ‘right’ join ensures that the result has all the airport codes that are available in the second dataframe.

Listing the available airport names of flights

You saw earlier that the ‘left’ join result contains the entry for the flight departing from the SIN airport:

However, you may wish to omit flights that does not have a valid airport code. In this case, you can perform an ‘inner‘ join:

pd.merge(df_flights, df_airports, on='AIRPORT_CODE', how='inner')

The result of the ‘inner’ join would now only contain rows that have airport code in both dataframes:

The ‘inner’ join ensures that the result has airport codes that are available in both the first and second dataframes. The default join for the merge() method is ‘inner’ if you do not specify the ‘how’ parameter.

Listing all the airport names and flights

The opposite of ‘inner‘ join is the ‘outer‘ join, where the values in both dataframes are available in the result.

Using our example, if you want to get all the airport names and flights, you can perform an ‘outer‘ join:

pd.merge(df_flights, df_airports, on='AIRPORT_CODE', how='outer')

The result would now contain all airports, regardless of whether there is a corresponding airport name or whether there is a flight from an airport:

The ‘outer’ join ensures that the result has airport codes that are available in the first, as well as those from the second dataframe.

Joining based on different columns names

So far our joins have been quite straight-forward where two dataframes have the same column names that we want to join on. In real life, it is far more common for dataframes to have different column names.

Assuming that the AIRPORT_CODE column for the df_flights dataframe is now changed to IATA_CODE:

import pandas as pd
df_flights = pd.DataFrame(
    dict(
        IATA_CODE=['MIA','MIA','LAX','DCA','SIN'],
        FLIGHT_NO=['3322','3213','4223','5678','1234']
    )
)
df_flights

To perform the join as we did previously, you now need to explicitly specify the column name of the individual dataframes using the left_on and right_on parameters:

pd.merge(df_flights, df_airports, 
         left_on='IATA_CODE',          # column for df_flights
         right_on='AIRPORT_CODE',      # column for df_airports
         how='left')

The result would now contain the join columns (IATA_CODE and AIRPORT_CODE) from both dataframes:

You can now proceed to drop one of the columns. However, before you do that, observe the IATA_CODE and AIRPORT_CODE columns for the last row— one has a value of ‘SIN‘ and one has NaN. In this case you should drop the AIRPORT_CODE column (if your intention is to show the airport code and airport names for all departing flights):

pd.merge(df_flights, df_airports, 
         left_on='IATA_CODE', 
         right_on='AIRPORT_CODE', 
         how='left').drop(columns='AIRPORT_CODE')

The result would now look like this:

Joining using multiple columns

In addition to joining dataframes with different column names, you can also join dataframes based on multiple columns. A good way to understand this is with an example. Consider the following dataframes:

df_confirmed = pd.DataFrame(
    dict(
        country=['Mainland China','Mainland China',
                 'US','Canada','US'],
        state=['Hunan','Anhui','Seattle, WA',
                 'Toronto, ON','Montana'],
        confirmed=[879,830,1,2,20]
    )
)
df_locations = pd.DataFrame(
    dict(
        country=['Bulgaria','US','Mainland China',
                 'Mainland China','US','Canada'],
        state=['Montana','Montana', 'Hunan','Anhui',
               'Seattle, WA','Toronto, ON'],
        lat=[43.4125, 46.965260 ,27.61041, 31.82571, 
             47.7511, 43.6532],
        lng=[23.225, -109.533691, 111.7088, 117.2264, 
             -120.74, -79.3832]    
    )
)

The df_confirmed dataframe looks like this:

While the df_locations dataframe looks like this:

If you want to find the locations of each state, you might be tempted to join the two dataframes by ‘state’:

pd.merge(df_confirmed, df_locations, on='state', how='left')

However, the result will contain six rows as there are two state of Montana – one in the US and one in Bulgaria:

The correct way to join would be to join based on the country and state columns (supplied as a list):

pd.merge(df_confirmed, df_locations, on=['country','state'],
         how='left')

The result would now be correct:

You can also join using multiple columns using the left_on and right_on parameters.

Conclusion

That’s it! I hope you now have a clearer picture of how joining dataframes work. Here’s a quick summary of the things we have discussed in this topic:

  • The ‘left‘ join returns all rows from the first dataframe
  • The ‘right‘ join returns all rows from the second dataframe
  • The ‘inner‘ join returns all rows that are available in both dataframes
  • The ‘outer‘ join returns all rows that are in the first dataframe, as well as the second dataframe
  • You can join dataframes based on different column names using the left_on and right_on parameters
  • You can also join dataframes using multiple column names

Related Articles