The world’s leading publication for data science, AI, and ML professionals.

Football Wages- 15 year Analysis with Python’s Beautiful Soup

A Multiple Page Web Scraping Tutorial using Beautiful Soup

Image Courtesy of Nathan Dumlao via Unsplash
Image Courtesy of Nathan Dumlao via Unsplash

Introduction

Web scraping enables the automatic gathering of a rich data set. If you can view some data in your web browser, you will be able to access it and retrieve it through a program. If you can access it through a program, the data can be stored, cleaned and used in any way.

Web scarping provides some advantages over Application Programming Interfaces (APIs). Web scraping is free, is not rate-limited, exposes you to all the data you wish to obtain, and most importantly, the website you wish to extract data from may not have an API. This is when web scraping enters the picture.

This tutorial will explore how to write a simple Web scraper which can gather data on average transfer income received per player over the last 15 years in the Premier League.

Getting started

To begin, it is necessary to navigate to transfermarkt. The next step is viewing the underlying HTML of this page, by right-clicking and selecting ‘inspect’ in Chrome.

What is particularly helpful here is the fact that you can hover over the HTML tags in the Elements tab, and Chrome will depict a transparent box over the representation on the web page itself. This can quickly help us find the pieces of content we are searching for.

Alternatively, you can also right click any element on the web page and click on ‘Inspect element’. This immediately highlights the corresponding HTML code in the Elements tab.

Here, I want to extract: Income, Income per Club and Income per player. In order to do so, I find the wrapper by hovering over the ‘div’ tag with the class ‘transferbilanz‘.

The Beautiful Soup Library

To parse the income information from the web page, I use the Beautiful Soup library. Beautiful Soup is installed easily using pip.

The URL of the Page where the information resides is placed in quotations in a variable called url. The requests.get() method is used to ‘get’ the URL page, and the result is saved in a response variable (remember to install and import the requests library into the script).

To confirm a connection has been established I use the .status_code attribute and check that a HyperText Transfer Protocol (HTTP) status code of 200 was returned back by the server. This means ‘success’, the page has been retrieved.

The response.text, which contains the raw HTML content for the current page is saved to the variable ‘financial_data’. Now, I create a soup BeautifulSoup object, with the raw HTML data financial_data, as the first argument passed, and the ‘html.parser’ as the second argument passed. This parser is built into Python and requires no installation, and is used to parse the HTML tags.

Once I know that a successful connection is made to the web page, I can take advantage of the structure of the URL.

In the URL shown below, the only portion that changes from each Football Season is the year. When I first checked the connection, the year was 2018 (highlighted in red and in bold format below). To retrieve the same information going back 15 seasons, I simply change the season number to 2004!

I create a simple for loop to iterate through a list of years from 2004 until 2018 to retrieve the income figures from each year. (remember the last number 2019 is excluded from the list)

Within this for loop, the three pieces of information I required are found within span tags with the class ‘greentext’ as shown below.

To retrieve these figure, I find the first wrapper with all this information, using the soup.find method and save the result into a variable called grouped_data.

I then use the find_all method on the grouped_data to find all the span tags with a class of ‘greentext’. As the find_all method returns a list-like object, I can index this list to retrieve the values I require, i.e [0] for the first element in the list points to ‘Income’ for the season (see the github gist below).

I need to remove the space, euro symbol and the periods within the number in order to convert the value to a floating point number for numerical based analysis later on. I also multiply each value returned by 0.89 as that is the current (13/06/2019) exchange rate between Euros and Pound Sterling and I would like the results in the DataFrame to be in Pound Sterling.

With each year (each iteration of my for loop), I append the data to appropriately labelled lists, i.e. income_per_player_list.

Once all iterations are complete, I create a DataFrame using the pd.DataFrame method (it is required to import the pandas module into the script), and write it to a CSV file.

I then read the DataFrame in;

Income2004_18_df = pd.read_csv(‘Income_euro_to_pounds.csv’)

This enables us to begin drawing conclusions:

(Bar-plots produced in using Python’s Seaborn Library)

We can see a steady climb in transfer income per player received each Season with the exception of 2018. Interestingly, this change mirrors the Premier League TV rights deal signed in 2016. More money is now in the game than ever before. This enables higher transfer fees to be paid for players and a corresponding rise in average income per player for each season for the Football clubs!

Sky and BT Sport have paid a record £5.136bn for live Premier League TV rights for three seasons from 2016–17.

Conclusion

This example tutorial has demonstrated how to perform a multiple page Webscrape and transform the data into a Pandas DataFrame amenable for analysis.

Essentially, we have thrown out our web browser and surfed the web to extract the information we require using the Python Program!


Related Articles