The world’s leading publication for data science, AI, and ML professionals.

Analysis of US Elections 2020 with Pandas

A practical pandas guide with many examples

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

The US elections have been followed closely by many people all around the world. After a period of vagueness, Joe Biden has been announced as the next president of the US.

I have come across a dataset about the 2020 US elections on Kaggle and decided to write this article. It will serve as a practical pandas guide and also provide more insight into the elections.

I’m far away from politics so the insights will be solely based on the numbers.

Let’s start by importing the libraries and reading the dataset into a pandas dataframe.

import numpy as np
import pandas as pd
elections = pd.read_csv("/home/soner/Downloads/datasets/elections.csv")
elections.head()
elections dataframe (image by author)
elections dataframe (image by author)

For each state and county, the dataset contains the number of votes for all candidates. The last column indicates who won the election in a particular county.

I wonder how many counties exist in the US. There are many different ways to check it. Here are two common ways:

elections.county.nunique()
3007
elections.county.value_counts().size
3007

Number of counties won by each candidate

Let’s check how many counties won by each candidate. I will show you two different ways to accomplish this task.

The first way is to use the groupby function and count the number of "True" values for each candidate.

winner = elections[['candidate','won','state']]
.groupby(['candidate','won'], as_index=False).count()
winner = winner[winner.won == True]
winner.rename(columns={'state':'won_county'}, inplace=True)
winner
(image by author)
(image by author)

We first select the "candidate" and "won" columns along with a random column (e.g. "state") and group them by the "candidate" and "won" columns. Then, we count the rows that belong to each group.

The next line filters the rows where the value of won column is True. As you can see in the resulting dataframe, only Donald Trump and Joe Biden won counties.

The second way is simpler.

elections[['candidate','won']].groupby('candidate')
.sum().sort_values(by='won', ascending=False)[:2]
(image by author)
(image by author)

We take the "candidate" and "won" columns and group by the "candidate" column. When we apply the sum function, the result is the number of won counties for each candidate (True is 1, False is 0).


The correct number of counties in the US

You may have noticed the number of counties we calculated at the beginning (3007) is much less than the total number of counties won by Donald Trump or Joe Biden. They must be equal.

What comes to mind is that there are counties with the same name in different states. To find out, we can count the number of unique county names in each state and sum them up. The result must be 4633 (3125 + 1508).

elections[['state','county']].groupby('state').nunique().sum()
county 4633

The result is 4633 as we expected.


Top 10 candidates based on the number of votes

We can easily find the number of total votes for candidate by using the groupby and sum functions.

elections[['candidate','total_votes']].groupby('candidate')
.sum().sort_values(by='total_votes', ascending=False)[:10]
Top 10 candidates (image by author)
Top 10 candidates (image by author)

The difference between Joe Biden and Donald Trump

Let’s find out the difference in the number of votes between Joe Biden and Donald Trump for each state.

We first need to calculate the total number of votes in each state for both candidates.

states = elections[['state','candidate','total_votes']]
.groupby(['state','candidate'], as_index=False).sum()
states = states[states.candidate.isin(['Donald Trump',
'Joe Biden'])].reset_index(drop=True)
states.head()
States dataframe (image by author)
States dataframe (image by author)

The states dataframe contains the total number of votes in each state for Donald Trump and Joe Biden.

We apply the groupby function to the states dataframe and use a lambda expression to calculate the difference.

states[['state','total_votes']].groupby('state')
.apply(lambda x: x.max() - x.min())
.sort_values(by='total_votes', ascending=False)[:10]
The top 10 states based on the difference in total votes (image by author)
The top 10 states based on the difference in total votes (image by author)

After calculating the difference, we use the sort values function and indexing to get the top 10 states based on the difference in total votes.

The biggest difference was observed in California.


The counties that Joe Biden and Donald Trump got the most votes

We can filter the elections dataframe using the candidate name and then sort values based on the total votes.

Here are the top 5 counties that Joe Biden got the most votes.

elections[elections.candidate == 'Joe Biden']
.sort_values(by='total_votes', ascending=False)[:5]
(image by author)
(image by author)

Similarly, for Donald Trump:

elections[elections.candidate == 'JDonald Trump']
.sort_values(by='total_votes', ascending=False)[:5]
(image by author)
(image by author)

Donald Trump lost all 5 counties where he got his most votes. On the other hand, Joe Biden all 5 of them. This is an indication that Joe Biden won relatively in more populated counties.


Counties in which candidates have less than 50000 votes

In the previous example, we made an educated guess that indicates Joe Biden won in relatively more populated counties.

Let’s do an example on the smaller counties. We will filter the dataframe based on the total votes of candidates in a county.

There are many options to filter a dataframe. One of the simpler ones is the query function that allows passing the condition as a string.

Let’s first find the counties where the total number of votes is less than 100000.

small_counties = elections[['county','total_votes']]
.groupby('county', as_index=False)
.sum().query('total_votes < 100000')['county']

We can now use the small_counties series to filter the elections dataframe. The isin method can be used to filter the dataframe based on the values in a series or list.

elections[elections.county.isin(small_counties)][['candidate','won']].groupby('candidate').sum()
.sort_values(by='won', ascending=False)[:2]
(image by author)
(image by author)

Donald Trump has won much more counties than Joe Biden where the total number of votes in the county is less than 100000.


We have covered some useful functions and methods of Pandas. There is, of course, much more you can do with Pandas. It provides numerous functions to enhance and expedite the Data Analysis and manipulation process.

One of things I really like about Pandas is that there are almost always more than one way to accomplish a given task. It proves the flexibility of Pandas.

The best way to comprehend all these functions and methods is through practice. Thus, I suggest to practice as much as you can.

Thank you for reading. Please let me know if you have any feedback.


Related Articles