It’s one of the closest Premier League title races in years and – with just a handful of games remaining who will take the honors this season? The race for the illustrious Champions Leagues places are also being closely contested.
With this in mind, I thought I would showcase how we can use the Python Pandas library to parse the Premier League Table, and show how we can start performing some initial exploratory analysis using Python.
Web Scraping
Pandas has a built in function, read_html() which uses the libraries lxml and Beautiful Soup to automatically parse data out of HTML files as DataFrame objects. To begin, it is a requirement to install some additional libraries used by read_html(). In the terminal, type:
pip install lxml
pip install beautifulsoup4 html5lib
The pandas read_html function has a number of custom options, but by default it searches for and attempts to parse all tabular data contained within
tags. This results in a list of DataFrame objects.
I import the pandas library and use the read_html function to parse the Premier League Table and assign it to the variable prem_table. This returns a list; of which I take the first element which points to the Premier League Table as of 20/04/19 BST 22:00. As much focus centers around the battle for the Top 4 places, I have decided to visualize the top 6 entries in this Table using the .head() method. To further confirm things are looking okay, I use the shape attribute, which tells me that there are 21 rows and 12 columns. This is a useful technique, because Python starts counting the rows from index 0 and there are 20 Premier League Teams, therefore I seem to have one row too many. As things looked good at the head of my DataFrame the problem must point to any extra row at the tail of the DataFrame.
As the task is to begin some exploratory analysis, I initially want to ‘clean’ the data. It seems unnecessary to have to column ‘Unnamed: 1’ in the DataFrame. To drop a column, I use the drop method, and pass in the keyword argument axis=1 for a column-wise drop.
The datatypes in the DataFrame
Each column in the DataFrame can be thought of a Series, where each column has to be of the same type. Each row however can be of different types. As the task involves performing some exploratory analysis, I need to decipher what type each Series is. A simple glance at the Table would inform me that the ‘Team’ column is a Python String. The pandas equivalent to a Python string is a pandas object. Furthermore, I would expect by visual inspection that the ‘P’ column denoting the number of games played is a Python float or int. However, assumptions aside, it is best to confirm this using either the dtypes attribute or the info method.
This output here informs me that the dtypes of the Series are ‘objects’, i.e. Python Strings. These types will have to changed in order to perform numeric calculations between the different columns. Thankfully, this is an easy task in Python, we simply use the pandas to_numeric method. pandas.to_numeric() is one of the general functions in Pandas which is used to convert arguments to a numeric type. Here, I am changing the columns which I suspect I might like to analyse into the Pandas float (float64) data type. To confirm, I can check my DataFrames dtypes:
Cleaning the DataFrame
The shape attribute earlier told me I had one too many columns, assuming of course that 20 teams still compete in the English Premier League! As things looked good at the head of my DataFrame, the problem must reside at the tail.
Clearly I have some unnecessary metadata, and a few NaN (not a number entries) populating the columns. Again, this time I can make use of the drop method, but importantly I omit the axis=1 parameter. I want to drop a row on this occasion, not a column. Things look much improved now. I can validate this row dropped successfully using the shape attribute.
Whilst things look improved, it seems odd to have the team’s league position column titled ‘Unnamed: 0’. It is not representative of the contents of the column. To change this, I use the Pandas Dataframe.rename() method, ** and pass inplace=Tru**e which makes changes in the original DataFrame if True.
Creating New columns
The DataFrame is well set up now. At this point, it might be worthwhile creating some new columns and testing out whether the ‘string-to-float’ conversion I created earlier will work!
I have created a new columns called ‘Goal Ratio’. This divides the number of goals scored against the number of goals conceded. To create a new column, I type the name of my DataFrame, then use square bracket notation and input my new column title in quotations. On the right side of the equation, I select which columns I want from the Table DataFrame using the same technique. Here, I divide each column ‘F’ entry against each column ‘A’ entry and round to one decimal place. By default, each new column that is created is appended to the end of the DataFrame.
To re-order the columns, I simply change the order of the column entries and add them to a list, and re-assign this to the Table Dataframe. The ‘Goal Ratio’ column fits more appropriately now. I have also removed the ‘Form’ column to make the Table a little neater.
Filtering the DataFrame
The Table is well set up and amendable for analysis. Let’s start by asking a simple question. Which teams have played one fewer games than their rivals? As teams have either played 34 or 35 games, I simply filter for teams that have played 34 games (as of 20/04/19) and the results appear. To filter, I use Boolean indexing. To see how to do this, read my article on how to ‘How to Filter Rows of a Pandas DataFrame by Column Value’.
Now let’s determine who has the most number of draws.
With Southampton attempting to avoid the drop this season, could all those draws be costly?
Finally, lets create two new columns, ‘Goals/game’ and ‘Goal conceded/game’ and sort the values, using the sort_values method.
With Huddersfield and Fulham already condemned, Cardiff have the third worse Goal ratio, Goals conceded/game and second worst Goals/game. They also have a worst goal difference than their nearest relegation rival, Brighton.
With Cardiff faring the worse here, will they survive the drop? Did their recruits not possess enough of a goal threat, is their defense too vulnerable?
With Python’s Pandas, we can find out!
Conclusion
This article has showcased some features which can help towards introductory analysis using Pandas. I enjoyed using pandas to delve into the Premier League stats. If you would like to follow me on Instagram for more Python snippets, give exit_code_zero a follow.
Lastly, some advice. I was parsing this Table whilst games where being played and results were coming in. This gave me slightly different results when I ran the analysis, which confused me at first! For a comfortable life, don’t perform the analysis, whilst games are in progress and finishing, put your feet up and watch the game!