Political Polling Data from Wikipedia with Python

Using Requests, pandas, and regular expressions in Python to get and clean Irish political polling data from Wikipedia to make it ready for analysis.

Liam Connors
Towards Data Science

--

Photo by Arnaud Jaegers on Unsplash

Introduction

In the Irish general election of 2011, the parties of the outgoing government (Fianna Fáil and the Green Party) saw a collapse in their vote. Fianna Fáil, who had won 77 seats at the previous general election in 2007, ended up with just 20 seats. The Green Party lost all six seats they had won in 2007.

I wanted to get polling data to understand how gradual falls or rises in public support were for the political parties in Ireland in that period. Where did the parties peak? Did support level changes coincide with particular events of the time — the unemployment rate, new party leaders etc.?

Requirements

In this article, I focus on collecting and cleaning the polling data.

We are going to use the Requests library to connect to Wikipedia to get the data. We will then load it into a pandas dataframe using the pandas library itself. The data won’t be perfect so we will need to clean it up before we can use it.

The re module is part of the Python standard library. But if you need to install pandas or requests, check out the install sections in their docs:

I am using Jupyter Notebooks (as part of Anaconda Navigator) in my examples to present the results. For information on how to install this, see the docs here.

Finding the data

The first step to getting the data is finding data somewhere to work with. Wikipedia has great data on elections and politics. The articles for many of the general elections in Ireland have the polling data since the previous general election. Here is the article I worked with.

Step 1 — Making the HTTP request

First, we import the libraries and modules that we will need.

import pandas as pd
import requests
import re

Next, we are going to add the URL where our data is so we can easily pass it to Requests to go to it.

url = "https://en.wikipedia.org/wiki/2011_Irish_general_election"

Now, we make the request and set the response equal to a variable called data

data = requests.get(url)

In data we should now have the HTML data at our URL, along with lots of other information about the request. For example, we could check the response code to ensure the request was a success:

data.status_code
>>200

In this case it was, and the part of the response we are interested in is the HTML content:

data.content

Step 2 — Getting the data into a pandas dataframe

Pandas can read HTML tables using its read_html() method. The method will return all tables in the content you pass in, convert each one into a dataframe, and store it in a list.

We could get all tables on the page by passing in data.content, but we can be more specific using the match parameter.

As the pandas docs note, using this:

The set of tables containing text matching this regex or string will be returned.

On the Wikipedia page itself, we can see this is pretty much the only table about polling data.

Polling data table on Wikipedia page

We will add “Polling Agency” as the match parameter.

tables = pd.read_html(data.content, flavor='bs4', match = "Polling Agency")

Now, in tables we should have a list of (hopefully only one) dataframes. If we run len() on it, we can see this is the case:

len(tables)
>>1

So, it’s going to be the first and only dataframe in the list. We save it in the variable below:

polling_data_2011 = tables[0]

Let’s look at the first 20 rows:

polling_data_2011[0:20]

We can see that it has done a pretty good job. And it seems like there isn’t much missing data, but there are some updates we will need to make.

Jupyter Notebook of initial dataframe

Step 3 — Cleaning the data

So we have the data, but as we can see, there are some issues with it.

  1. It would be good to also store the date in three separate columns day/month/year. Then it would be easier to see change from year to year.
  2. In the Source column we have the Wikipedia reference numbers. We will want to get rid of those as they are not actually part of the source name.
  3. We want to remove the %s so we can work with the numerical data.
  4. Finally, you’ll notice there are some NaN values in the Polling Agency column. In this case it is because those rows are not opinion polls at all but actual election results — we have the GE results, along with some European and local election results. What we will do is create another column to categorise the row as either election data or opinion polling data. This way, we will be able to easily just work with the opinion polling data if we want.

Parsing the date into three columns

We create the three regular expressions that we want to match. One for the day, one for month and one for year. We then loop through each field in the date column, getting its index and value, and checking the regex against the value. We then insert the found value into its own new column — we do this for all three.

day_re = "[0-9]+"
month_re = "[A-Z][a-z]+"
year_re = "[0-9]{4}"
for index, value in polling_data_2011["Date"].items():
#day
match = re.search(day_re, value)
if match:
polling_data_2011.loc[index,"Poll_day"] = match.group()

#month
match = re.search(month_re, value)
if match:
polling_data_2011.loc[index, "Poll_month"] = match.group()

#year
match = re.search(year_re, value)
if match:
polling_data_2011.loc[index, "Poll_year"] = match.group()

We now have the three new columns:

Jupyter Notebook showing three new columns

Cleaning the Source column

In the Source column, after each actual source name, the reference number begins with [.

So we loop through the fields in the Source column, similar to what we did for the date. This time, however, we will match this regular expression and set the new_name to everything up until this value in each field.

Note, in regex_to_match we need to escape the character we are searching for as this character has a special meaning in regex.

regex_for_reference = "\["for index, value in polling_data_2011["Source"].items():
ref_match = re.search(regex_for_reference, value)
if ref_match:
location = ref_match.span()[0]
new_name = (value[0:location])
polling_data_2011.loc[index,"Source"] = new_name
else:
polling_data_2011.loc[index,"Source"] = value

It looks tidier:

Jupyter notebook with references removed

I want to check that there aren’t any other obvious errors. Wikipedia is a source that has many contributors. Maybe different contributors entered names differently.

Let’s see all unique values in the Source column:

polling_data_2011["Source"].unique()
Unique values

We see one of the sources is expressed differently in at least one row — we need to update the one with the period (The Sunday Business Post.) to be the same as the other (The Sunday Business Post).

Let’s alter the regular expression we used before and remove the period anywhere it exists:

regex_for_period = "\."for index, value in polling_data_2011["Source"].items():
period_match = re.search(regex_for_period, value)

if period_match:
location = period_match.span()[0]
new_name = (value[0:location])
polling_data_2011.loc[index,"Source"] = new_name
else:
polling_data_2011.loc[index,"Source"] = value

I also did the same check on the Polling Agency column. There were no issues.

Turning the columns with % into numerical data we can work with

Here we will remove the % from each field that has it. Let’s do this for each of the political parties. We loop through the columns, and then the fields in each column. We retain only everything up until the % is found.

Note: We could use find() here (and in some other cases) instead of re, but we will stick with re to be consistent.

regex_to_find_percent = "%"
columns = ["Fianna Fáil", "Fine Gael", "Labour Party", "Green Party", "Sinn Féin", "Ind./Others"]
for column in columns:
for index, value in polling_data_2011[column].items():
match = re.search(regex_to_find_percent, value)
if match:
match_index = match.span()[0]
new_value = (value[0:match_index])
polling_data_2011.loc[index, column] = new_value

It looks good:

Jupyter Notebook with issue in Source column fixed

We now need to convert those columns to float type so we can use methods such as the pandas describe method() on them to see information such as min, max and mean values.

for column in columns:
polling_data_2011[column] = polling_data_2011[column].astype(float)

Now the describe() method gives us some descriptive statistics on the data.

polling_data_2011.describe()
Jupyter Notebook describe() of the data

Separating the opinion polling data from the election data

In the above table, we know there are a few rows that are not opinion polling data. Let’s categorise each column so we know what we are working with. Then we will be able to work just with opinion polling data if we want.

I am actually going to separate this into three categories: election results, opinion polls, and exit polls. The reason I am doing this is that although the latter two are polls, they are slightly different. An exit poll is based on interviewing people after voting, while standard opinion polls are about how a person intends to vote.

We will check for instances of “election” and “exit poll” in each field in the source column. We will also account for when it might appear at the start of a sentence with a capital letter. If we match election we put an E in a new column called Poll_type. If we match exit poll, we put EP in the Poll_type column. If we don’t match either of these, weassume that it is an opinion poll — OP.

election_regex = "[Ee]lection"
exit_poll_regex = "[Ee]xit poll"
for index, value in polling_data_2011["Source"].items():
election_match = re.search(election_regex, value)
exit_poll_match = re.search(exit_poll_regex, value)
if election_match:
polling_data_2011.loc[index, "Poll_type"] = "E"
if exit_poll_match:
polling_data_2011.loc[index, "Poll_type"] = "EP"
if not election_match and not exit_poll_match:
polling_data_2011.loc[index, "Poll_type"] = "OP"

Step 4 — Checking our data

Now that we have our final table, let’s do a few checks to make sure the data is correct.

From a quick scan of the Wikipedia page, we can see that there was one exit poll in the period and a general election in 2007 and one in 2011. Let’s do a pivot table by Poll_year and Poll_type to ensure we match the original table there.

polling_data_2011.pivot_table(index=["Poll_year","Poll_type"])
Jupyter Notebook — pivot table by poll yearand poll type

In the above table, we can see that the election results in each of the years match what was reported on the Wikipedia page. For example — 2011:

Wikipedia screenshot of first row of polling data table

And we have the elections and exit polls in the years we would expect based on the original data. The data looks good now, ready to use.

The process for getting this data was straightforward enough. It was very easy to get the specific table I wanted and to do the necessary data cleaning.

In the future I would like to look at other options to manage the date column and also explore how easy it would be to get additional data using what I’ve already done.

All the images of the code and the tables were taken by the author. The Jupyter Notebook for this article can be found here:

--

--