VLOOKUP implementation in Python in three simple steps

Himalaya Bir Shrestha
Towards Data Science
4 min readSep 18, 2021

--

Oftentimes, in the field of data analytics, it’s the data cleaning and processing that takes the most amount of time and effort. While the steps such as data cleaning, filtering, and pre-processing are generally under-evaluated or overlooked as compared to more juicy components such as the model development, it’s the quality of data that determines the quality of output. As it is rightly said, Garbage In = Garbage Out, and vice versa.

Functions such as VLOOKUP, HLOOKUP, XLOOPUP, and INDEX MATCH are highly useful in Excel to find the required data that meet the given conditions from a data table or a range. In this article, I am going to describe a simple way to replicate the VLOOKUP function of Excel in three easy steps in Python. The script used for this article is available in this GitHub repository. Let’s get started.

Image from Unsplash by Jonnelle Yancovic

Data

First, I am going to describe the data we use for this article.

The World Bank Open Data is a great source of free and open access data on global development. One can find hundreds of datasets on topics related to global development ranging from economics, demography, climate change, health, infrastructure, energy, and so on from the databank of the World Bank. Analyzing the time series data of these indicators is crucial to formulate the policies, institutional and regulatory framework, which is apt for a country or a region.

The World Bank data could be downloaded directly from the database in the Excel or CSV format, or it could also be accessed directly from Python using pandas_datareader as shown in the script below:

For this article, I am going to use selective indicators for certain countries for a particular year. I select GDP, GDP per capita, access to electricity, population, and CO2 emissions data from the World Bank database by providing the code for the same indicators which can also be accessed from the website. Next, I select 12 random countries by providing their ISO codes and download the data for 2018. I rename the columns with the indicator names instead of codes and select only countries in the index. The dataframe df looks as follows:

Development indicators selected from World Bank database for 2018

VLOOKUP implementation in Excel

In this section, I am going to describe how the VLOOKUP function works in Excel to get a required subset of data from a data table or range. As shown in the Excel screenshot below, columns A to F represents the same data as in df. In columns I to L, I want to get the CO2 emissions and population data from the first table.

VLOOKUP function implemented in Excel in the table in right (I1:L6) to get the value from table in the left (A1:F13)

In the formula bar denoted by red rectangle in the Excel screenshot, we can see the formula used for cell J2.

=VLOOKUP($I2, $A$1:$F$13,6,FALSE) * 1000

wherein,

$I2 represents the country Bhutan that we want to look up in the first table;

$A$1:$F$13 represents the entire first table where the lookup value and data to be returned are to be found

6 represents the column number (herein, representing the kt CO2 column) from where the corresponding value is to be returned (for Bhutan), and

FALSE represents the exact match.

Three easy steps to implement it in Python

Herein, I describe the implementation of the same function used in Excel above, but in Python in three simple steps:

First, we create a dataframe called df_target where we want to have the desired subset data. Initially, the dataframe comprises the names of the five desired countries. And then we create empty columns for the indicators- CO2 emissions (tonnes) and Population that we want to return from df.

Step 1

The countries column in df_target is common with the country column of df. Therefore, second, we set the countries column in df_target as the index.

Step 2

Third, this is the main step we map the index in df_target against df to get the data for required columns as output. For example, the values in kt CO2 column of df multiplied by 1000 is returned for the CO2 emissions (tonnes) column of df_target. The map() function maps the value of Series according to input correspondence and is used for substituting each value in a Series with another value, that could be extracted from a function, a dictionary or a Series.

Step 3

In this article, we learned an easy technique to implement the VLOOKUP function of Excel in three simple steps in Python. The functions and modules of different packages in Python have enabled to automate several steps, that otherwise would have cost a lot of time and effort, and in this sense, thus have been a real blessing to optimize costs and resources.

--

--

I write about the intersection of data science with sustainability in simple words. Views reflected are of my own, and don’t reflect that of my employer.