Formula One: Extracting and analysing historical results

Web scraping and data analysis of an F1 season with Beautiful Soup and Pandas

Ciarán Cooney

Published in

Towards Data Science

9 min readJul 18, 2020

Data is everywhere

Sometimes as a data scientist, you won’t have the data you require at your fingertips, or there is no existing pipeline feeding you the data you need. At other times, you might simply be interested in exploring an area that doesn’t have a freshly scrubbed dataset available. Fortunately, the internet has an intergalactic ocean of data (in case you didn’t already know) on virtually any subject you could wish to analyse, and Python has all the tools you need to scrape and format that data for your chosen project.

For me, the subject I’m going to explore here, and in some future posts, is historical Formula One results. For the uninitiated, F1 is the elite motorsport category in the world in which legends like Aryton Senna and Michael Schumacher wrestled their 200mph steeds in wheel-to-wheel combat!

Luckily for me, the official formula 1 website contains archived data from F1 championships, beginning in 1950, up to the present day. Races results, qualifying times, championship positions and many other results are there in the archive (see below) and it is easy to navigate the website to find any information you’re interested in.

Source: screenshot of formula1.com webpage

Before I get you revved up too high on all this F1 stuff, let’s use Python to extract some information from the archive and display it with Pandas.

Beautiful Soup

Beautiful Soup is a great package for parsing the HTML data making up a webage into a more readable and useable format. I used Beautiful Soup, urllib and Pandas to scrape data from the F1 archive and present it in a DataFrame. Some of the historical data is a little sparse if we go further back in time circa 1950, so for the moment I am going to begin in 1990 —that is, with modern era F1 and comprehensive data on results.

Starting with something simple, I decided to scrape the overall championship placings from 1990 and plot the results. First, I used urllib.request.urlopen() to open the webpage containing the 1990 driver’s championship results (you can navigate to the relevant webpage and copy the link). Then I used the find_all() method with the ‘table’ argument to search for tables contained in the webpage (here there is only one). Finally, I used the Pandas method read_html() to transform the data into a DataFrame. This is a really useful method for anyone scraping data from a webpage for subsequent analysis as it makes it simple to begin working with Pandas functions immediately.

Source: plot made for this post (Ciaran Cooney)

Okay, so I successfully extracted some data from a website, and although we can see that Aryton Senna beat Alain Prost to the 1990 F1 world title, there is nothing terribly interesting here!

Rather than taking the final championship standings as above, I thought it might be more interesting to extract results from each individual race and build my own championship table showing all results and then plot the drivers’ progress over the year. Before doing this I wanted to see how data for each race was stored on the webpage, so I navigated to the 1990s race results page here.

You can see from the screenshot above that the first race of 1990 was in the USA, and I used this race to view the race results format, using similar code to the previous example.

The resulting DataFrame contains some useful columns, some less useful columns and also some unreadable columns. For the purposes of constructing my own championship table, I only really required ‘Driver’, ‘Car’, ‘No’ and ‘PTS’ columns for the moment. Rather than manually clicking through webpages to extract data or find links to all the race results, it is much better to automate this process using a web crawler.

Crawling for links

To do this I had to implement a search through all links in the 1990 race results webpage (the page from the screenshot above) and extract only those I wanted to retain, based on certain conditions. I wanted to apply some conditions to the list of race urls because I didn’t want unrelated information or duplicate links which are often contained in webpages. To figure out which conditions to use, I looked through the source file for the webpage (ctrl + u) and found that each race result link contained the year and the string ‘race-result’ — red arrows point towards these below. This would allow me to iteratively search for links and only retain those that matched these conditions.

To extract all race results, I created a function which takes the relevant year as an argument and returns a list of urls. Urls are added to the list when they meet the conditions (‘1990’ and ‘race-result’) and are not already contained in the list. Note: ‘a’ tags — as in soup.find_all(‘a’) — are links, and tell the browser to render a link to another web page.

That returned a list of urls, each corresponding to a single set of race results. I was then able to use this list to iteratively load each race result into a DataFrame and combine all the season’s results into a single DataFrame containing all classified drivers and all points scored.

On the first iteration, after loading race results into a DataFrame, I created a new DataFrame for storing results for the entire season, containing driver and car numbers for those classified in the first race (season_results_df = pd.DataFrame(df[[‘Driver’,’Car’]], columns=[‘Driver’,’Car’], index=df.index)). For each subsequent iteration, I added new drivers to the DataFrame if they have not previously been classified (lines 25–26, below). Points scored by each driver are then extracted from each race and added to the season_results_df DataFrame (lines 28–30, below). Finally, I formatted the DataFrame by sorting by driver number (‘No’), filling any NAN values with zeros and reformatting the car manufacturers’ names to a three-letter version by applying lambda and map functions.

As you can see, this returned a DataFrame containing the points scored by each driver at each race. Note that the DataFrame is currently sorted according to the driver numbers (Prost as reigning world champion was number 1). I could have added a ‘total’ columns and summed each driver’s points for the entire season but that would have provided me with much the same information as I had in the beginning.

Instead, I decided to create a DataFrame tracking each driver’s cumulative total over the course of the season using the cumsum() method available in Pandas. I also rearranged the DataFrame to show the final championship positions from Senna down. One of the reasons for plotting my own championship table with cumulative results was to make it easy to see if rivalries fluctuated throughout the season. For the purpose of tidying up the figure legend, I transformed the driver names to the first 3 letters of their last name only.

Now I can see the accumulation of points over the season and results from the final race of the year in Australia indicate the final championship standings. However, it is not easy to get a general sense of how the season progressed from looking at the DataFrame, so I plotted the data as well.

Now, you can see without having to look too hard that in 1990 there were only ever really two drivers in contention for the title (Senna and Prost) and that Prost did actually get in front of Senna around the midpoint of the season before the lead was stretched again. It is also clear from looking at this that the total points scored is dominated by a relatively small number of drivers — out of 36 drivers, only 9 scored more than 10 points. Another thing I was able to notice from looking at the cumulative graph was the form of Jean Alesi (gold). He started the season in great form with 13 points from the first 4 races but then completely flatlined for the remainder of the season.

You might have noticed that the final scores differed for some of the drivers in comparison with the final championship standings I plotted above. This wasn’t something I had anticipated so I had to do a bit of digging around to find out what was going on. It turns out that F1 1990 employed a ‘best 11’ rule that saw championship standings based on the accumulation of drivers’ best 11 results from the 16 races (wiki). Due to this I decided to look at the data using this formulation and created the function below to calculate cumulative scores based on the best 11.

The best_11_cumsum() function simply computes the cumulative scores for the first 11 races. From the 12th race onwards, it checks whether the points scored at that race were greater than the minimum value from the previous best 11 scores. If this evaluated to true, the previous minimum is removed, the new point score added and a revised cumulative score calculated.

You can see now that the points scores match the final championship standings above and from the plot below that the overall effect of the ‘best 11’ scoring system was minimal. One of its only real effects was that it reduced Nelson Piquet’s final score by one point, this taking him from outright third into a tie with Gerhard Berger.

Form is ephemeral but it can nevertheless be analysed. I wanted to look at the fluctuations in performances across the season, particularly that of Senna and Prost as they fought for the championship. Plotting a rolling average is a rudimentary way of tracking data points over time and I decided to use a 3 race rolling average to see how the drivers’ form differed.

This plot is a little noisier at first glance but it does show that there were some fairly substantial shifts in momentum between the two title contenders. You can see that the point at which Prost’s form peaked, Senna’s reached its nadir and vice-versa. At the point at which Prost’s form peaked and he subsequently took over the championship lead, it must have appeared that he was on an inexorable roll. That, of course, didn’t last and Prost’s demise coincided with a clear resurgence in Senna’s form, finally sealing the title.

You may have noticed from the DataFrames and plots that grand prix winners received 9 points in 1990. When I began watching F1 later in the ’90s, driver’s were awarded 10 for the win, and now it is 25 (albeit with an overhauled points structure — I guess this is sort of my domain knowledge). I’ve often wondered whether these points differences actually change who comes out on top in the championship fight. I did a quick test of this for 1990 with a simple if statement: if pts ==9: pts = 10. Anyway, it wasn’t an issue that year as Senna would have gained 6 points and Prost 5, and everything would have remained the same.

Where to next?

What I have shown here is a fairly basic intro to web scraping and some exploratory data analysis over a single F1 season, but there is much more that can be done. I could look at how the form of drivers and teams varies over the course of seasons and even decades. I could attempt to wrangle some information out of qualifying times to make some assertions about who were the fastest drivers in different eras. I can map the different points systems onto different seasons to see how they actually affect final outcomes and I could perhaps look at longer term effects such as how competitive the series has been during different epochs. I could even look at building a predictive machine learning model that could perhaps use qualifying times and previous race results to make forward predictions about race results.

You might even like to have a go at some of these things yourself.

All the code I used for this post is available in the form of a Jupyter Notebook here.