Intro to Pandas for Excel Super Users
I have been in a serious relationship with Microsoft Excel for many years. It was there for me throughout college and two analyst jobs. I got to know Excel very well; all its curves and edges, its shortcuts and imperfections. Although it wasn’t the most efficient tool out there, it got the job done without requiring me to learn how to code.
Then, about eight months ago, I started graduate school and was introduced to Python. We became instant friends. Along with Python came something called Pandas. I wasn’t sure how I felt about Pandas at first. To be honest, it felt a bit like this:
Analyses that took me 10 seconds in Excel would require 10+ minutes of Google searches and trial and error. Over time, though, Pandas grew on me. It does everything that Excel can do and more, only faster, more flexible, and fully reproducible. I continue to learn something new about Pandas every day, but have yet to reach the level of intimacy that I had with Excel.
In the meantime, I’d like to spread the love by sharing some common Excel functions/tools for analysis that can be done in Pandas. For the following examples, we will be exploring school progress report data from Chicago Public Schools for the 2016–2017 school year. The data contains detailed information at the school level, with one school per row.
Excel: Data is represented in an Excel sheet with columns, rows, and cells. We can have different sheets for different data sets.
Pandas: A “table” of data is stored in a DataFrame. We can create a DataFrame from scratch, or more commonly, import the data from a csv file:
import pandas as pd
sy1617 = pd.read_csv('CPS-Progress-Reports_SY1617.csv', index_col='School_ID')
Columns will be labeled as they were in the csv. Rows are labeled with unique identifiers as well, called the “index.” We have selected the School_ID column to serve as the index, since each school ID uniquely identifies a school row. If we do not explicitly state a column to use as the index when loading the csv, Pandas would automatically generate an index that labels the rows starting at 0.
To quickly view the top of a data frame:
Excel: One benefit of Excel is that the data is right in front of us at all times. We use basic point and click commands or keyboard shortcuts to select data.
Pandas: There are a few different ways to access specific rows, columns, and cells.
To access an individual column, use square brackets. The output is a Series object, which is a one dimensional array:
To access multiple columns, specify a list of column names. The output is now a DataFrame:
We can access a row by its index label:
We can also specify the row number instead:
To access a single cell, simply subset by row and column:
Basic Summary Statistics
Excel: Apply count, average, median, percentile, etc. functions across desired columns or rows.
Pandas: The describe method can display summary statistics for selected columns. The types of statistics outputted will depend on the data types of the columns.
Text columns will output count, unique, most common, and frequency of the most common:
Numerical columns will output count, mean, std, min, max, and lower, 50, and upper percentiles:
Excel: Apply filters to column(s) to subset data by a specific value or by some condition.
Pandas: Subset a DataFrame by some condition. First, we apply a conditional statement to a column and obtain a Series of True/False booleans. We then put those results into square brackets to subset the DataFrame for only rows that meet the condition (i.e. are True).
For example, filter the DataFrame for schools that are of type “Charter”:
is_charter = sy1617['School_Type'] == 'Charter'
We can look for multiple values in a column, such as “Charter” and “Magnet” schools, using the isin method:
charter_magnet = sy1617['School_Type'].isin(['Charter','Magnet'])
Filter for schools with student survey response rates of at least 80%:
gt80 = sy1617['School_Survey_Student_Response_Rate_Pct'] >= 80
We can combine multiple conditions with & and |:
sy1617[is_charter & gt80]
Excel: Sort the data by a certain column or set of columns.
Pandas: Sort the data using the sort_values method. For example, sort alphabetically by primary/middle/high school and school name:
Excel: Pivot tables are my go-to #1 in Excel. The drag and drop functions make it easy to aggregate and filter the data in any way. Here is a sample pivot table that groups by School_Type in the rows and Primary_Category in the columns, and calculates average School_Survey_Student_Response_Rate_Pct within the table.
Pandas: We can produce the same table in Pandas using the pivot_table function. The resulting DataFrame is not as pretty, but it gets the job done!
pd.pivot_table(sy1617, values='School_Survey_Student_Response_Rate_Pct', index='School_Type', columns=['Primary_Category'], aggfunc=np.mean)
Excel: VLOOKUPs are my go-to #2 in Excel. In a nutshell, the VLOOKUP function searches for a specific value in a range of cells, and then returns a value that lies in the same row as where the value is found.
I used to rely heavily on VLOOKUPs to join the relevant columns of one data set with another. For example, suppose we have the school progress report for Chicago Public Schools from the prior year, in a sheet titled “SY1516.” We want to know the SY1516 Student_Attainment_Rating for every school in the SY1617 data set, so that we may analyze the change in rating between the two years.
We could bring in this data by creating a column of VLOOKUPs, using School_ID as the lookup value:
If we wanted to bring in any other columns from the SY1516 sheet, we would need to add an additional VLOOKUP column for each.
Pandas: Joining two data sets is much simpler in Pandas. There are a ton of things we can do with DataFrames, and you can find some great examples of merges, joins, and concatenations here.
For now, let’s use Pandas to replicate the above VLOOKUP example. The merge function allows us to combine the two data sets using their indices (School_ID) as a sort of “lookup value.”
# read in the csv for prior year's data
sy1516 = pd.read_csv('Chicago_Public_Schools_-_School_Progress_Reports_SY1516.csv', index_col='School_ID')
sy1617_short = sy1617[['Student_Attainment_Rating', 'Long_Name']]
sy1516_short = sy1516[['Student_Attainment_Rating']]
pd.merge(sy1516_short, sy1617_short, how='right', left_index=True, right_index=True, suffixes=('_1516','_1617'))
Note: how=’right’ ensures that we are capturing all the school IDs from the second/right (SY1617) DataFrame; this is similar to a right outer join in SQL.
Alternatively, since School_ID is the index for both DataFrames, we can use the join method to achieve the same results:
sy1617_short.join(sy1516_short, how='left', lsuffix='_1617', rsuffix='_1516')
Using merges and joins, we can easily pull in as many columns from the previous year’s data as necessary.
These are just a few of the many ways that Pandas can perform Excel functionality, and much more. I hope this inspires you to consider straying from Excel and giving Pandas a chance!