Sports Analytics
Football (or soccer for the USA readers) is an amazing sport. It can’t be the world’s most popular sport by coincidence.
Football gathers people together, it’s an excuse to disconnect from our busy lives because game time is fun time. We order some fast food and eat it while Messi makes magic with the ball – how lucky we are for having been able to enjoy him. And we get to watch so many amazing teams like 2010’s Barça or even 2023’s Manchester City.
Many will say no match is equal. It’s football, and there’s nothing like it. But I’d say that’s wrong.
As outstanding as it is, it still is dominated by math. Like everything else.
Life is full of mathematical models. And football is no exception.
I’ve been a die-hard Barça fan throughout my entire life. Add that to the current situation I find myself professionally in, and the result is a genuine interest in sports analytics – obviously inclined toward football.
This post is the first I’ll be writing about sports analytics, so I will keep it relatively simple. However, I plan on writing a lot more to learn a lot about how math applies to football (and potentially other sports like handball) – and share the insights with you all.
The amount of data scientists getting hired for sports analytics roles is increasing strongly and it doesn’t seem to be stopping anytime soon. Using data in sports makes more sense than ever, especially given that the amount of data being generated is also increasing at a fast pace.
So, this post will be a great intro tool for all aspiring sports analysts or data-related folks interested in sports.
Here, I’ll be using StatsBomb’s[1] open and free data[2] to inspect the La Liga season of 2015–2016, which I’ve randomly chosen. I invite you to do the same analysis and see if it holds true for other seasons and leagues as well.
So let’s dig in!
Preparing The Data
There’s a wonderful Python module that will allow us to get all the data we need: statsbombpy
[3]. The first thing we need to do is logically install it:
pip install statsbombpy
Then open your Python file or notebook and start by importing the next modules:
import pandas as pd
from statsbombpy import sb
import seaborn as sns
Next, we’ll want to get all the La Liga games from the 2015–16 season:
competition_row = sb.competitions()[
(sb.competitions()['competition_name'] == 'La Liga')
& (sb.competitions()['season_name'] == '2015/2016')
]
competition_id = pd.unique(
competition_row['competition_id']
)[0]
season_id = pd.unique(
competition_row['season_id']
)[0]
matches = sb.matches(competition_id=competition_id, season_id=season_id)
This is what the matches DF looks like:
Now, we want to inspect goals and we don’t have a column showing the number of goals in a game. We can create it ourselves just by adding the home_score
and the away_score
columns:
matches['goals'] = matches['home_score'] + matches['away_score']
We’re now ready to start analyzing.
Do Previous Goals Influence The Future Goals?
Before doing any calculations, I like to visualize things. It’s the best way to understand the data you’re playing with. So let’s build a histogram:
import seaborn as sns
sns.histplot(
x='goals',
data=matches,
bins=matches['goals'].nunique(),
binwidth=0.9
)
As expected, between 1 and 4 goals were scored in most matches. We all know the most common results are always 1–0, 2–0, 2–1, 3–0, 3–1 (same for the away team). There’s the 12-goal game case which is the clear outlier – Real Madrid vs Rayo Vallecano that ended 10–2.
Nothing new so far, just interesting facts.
The average number of goals in that season was 2.74 (rounded). That translates to an average of 0.030497 (rounded) goals per minute (or 1/32.79). In other words, we could say there’s a 0.030497 chance of a goal in each of the 90-minute slots.
The fun part comes now: we’re going to simulate the season. As we’re only focusing on match goals – not game winners or season standings – the only parameter we’ll use is the minute probability of a goal.
# Creating a simulation
import random
mean_goals = matches['goals'].mean()
def simulate_match():
goals = 0
for i in range(90):
goals += np.random.choice(
np.arange(0, 2),
p=[1-(mean_goals/90), mean_goals/90]
)
return goals
def simulate_season(n_games):
goals_per_game = []
for i in range(n_games):
goals_per_game.append(simulate_match())
return goals_per_game
We now have the simulate_match()
and simulate_season()
functions that consist of a simple for loop each (so we loop on a minute basis on each match) and then randomly compute whether there was a goal in that minute or not.
To do so, we use random.choice()
, which chooses from 0 or 1 with the specified probabilities (~0.030497 for 1, ~0.969503 for 0).
Using the previous histogram we saw, let’s paint the simulated distribution line superimposed:
goals_per_game = simulate_season(len(matches))
mu = np.var(goals_per_game)
pmf = poisson.pmf(goals_per_game, mu)
pmf *= (most_repeated_count/pmf.max())
sns.lineplot(
x=goals_per_game,
y=pmf,
color='red',
label="Simulated"
).set(ylabel='count')
sns.histplot(
x='goals',
data=matches,
bins=matches['goals'].nunique(),
binwidth=0.9
).set(title='Goal vs Simulated Poisson distribution')
Based on the simulated data, the solid, red line shows the Poisson distribution. This distribution fits well whenever the timing of previous events has no effect on future events.
In other words, using Wikipedia: "A Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event."[4]
The takeaway of this plot is really straightforward: This is a model generated by random simulations, and it’s pretty close to the actual outcome.
So we have our answer: goals aren’t influenced by the number of goals so far nor the amount of time played. They are unpredictable in time.
As David Sumpter put it in Soccermatics[5]: "It is the unpredictability of a football match from one minute to the next that produces the Poisson distribution after 90 minutes. We know the average number of goals scored in a match, but their timing is unpredictable. As a result, some scorelines become much more likely than others. The paradox here is that scores are explained by randomness. The fact that goals are very random in time makes the pattern in the results predictable".
Conclusion
Goals aren’t random per se, they happen for a reason. But they are certainly unpredictable. It’s when certain conditions meet that we have goals: defenders make mistakes, the attackers move the ball well, Messi gets the ball, the goalkeeper can’t stop it from going in…
A punctual play can completely change the direction of a match at any moment. And we’ve mathematically proved it today – goals are not influenced by previous goals or the time in which they occur.
But we shouldn’t stick to football for our conclusions. This simple analysis we’ve performed can be made on data not linked to football in any way. And I’m not only talking about hockey goals, handball goals, or basketball points.
Data Scientists can expect a Poisson distribution whenever it is reasonable to assume that events can happen unexpectedly, at any time, independently of how many events have happened prior to the next one.
Do you see how powerful this is? A simple, yet powerful mathematical modeling tool.
If you don’t trust me, go ahead and try it with any data you like. For example, try to analyze how many people go to the supermarket on a given day, the number of accidents in a factory, or the number of cars that cross the border between two countries. You’ll see they can’t escape the Poisson distribution either.
Again, a powerful tool for data scientists to understand the data we’re working with.
Getting back to football, it obviously isn’t a random sport. It’s much more than that: it’s about setbacks, comebacks, lineups, formations, tactics, skill…
If we didn’t want to go deeper, randomness would probably be enough to study the number of goals over a season. But football is way more interesting than just interpreting the goal distribution.
In my next soccer analytics posts, we’ll go further and beyond simple randomness.
Thanks for reading the post!
I really hope you enjoyed it and found it insightful.
Follow me and subscribe to my mailing list for more
content like this one, it helps a lot!
@polmarin
If you’d like to support me further, consider subscribing to Medium’s Membership through the link you find below: it won’t cost you any extra penny but will help me through this process.
Resources
[1] StatsBomb
[2] StatsBomb’s License – GitHub
[4] Poisson Distribution – Wikipedia
[5] Soccermatics: Mathematical Adventures in the Beautiful Game