Hands-on Tutorials

In most sports, it is widely believed that the home team has a significant advantage over the visiting team, commonly known as the home advantage. This can be due to a variety of factors but two of the most popular reasons cited are fan presence and the players’ familiarity and comfort with the environment. An iconic example of the latter is the "Green Monster" wall at Fenway Park for the Boston Red Sox. However, some research and data show that player familiarity is a surprisingly weak predictor of home advantage effects.
Therefore, that leaves one of the most supported reasons, which is fan presence. Interestingly enough, most sports stadiums were closed to the general public at the onset of the Coronavirus pandemic, and home advantage seems to have gradually faded. This has created an ideal scenario for analyzing this natural experiment.
The question I intend to answer is simple: can I use data analytics & statistical inference to measure the impact fans have on home teams in football (Soccer)? I’ll do just that. And I’ll focus on England’s Premier League data.
Why football? And why the Premier League?
Before choosing the sport to investigate, I considered some of the major Sports in the top leagues of the world and there were a few disadvantages for each:
- Basketball – The NBA at first seemed like a good choice given the long season and the indoor stadium drawing in loud and energetic fans. However, all games in their restart of the 19–20 season were played in the Orlando Bubble so there was no real "home court".
- Baseball – The MLB season had not started when the pandemic began and generally home advantage isn’t as high relative to other sports.
- American Football – The NFL season is very short relative to other sports which would not provide us with a lot of data for experiment purposes.
Soccer appears to be the best sport to examine especially since its home advantage is strongest among most sports and the number of matches played behind closed doors was sufficient. And I’ll choose the Premier League because it tends to have more balanced teams, meaning many more teams can beat another team on a given day and place at the top of the league table in a given season. In other majors leagues like the Bundesliga or La Liga, 2–3 teams seem to dominate every year. You know which teams I’m referring to.
Initial Hypotheses or Expectations
Before collecting the data I need, I have some guesses on what I’ll see when looking at the data for matches with no fans:
- A reduced advantage for home teams meaning fewer goals and points won
- A more balanced game with fewer fouls given to the away team since the referee isn’t pressured by loud or upset fans
- Less effort from home teams resulting in fewer shots taken and fewer goals in the later minutes of a match
The Data: Collection in Python
Data: Matches from 2019–2021 (8/9/2019–3/15/2021)
I will use the Python & Selenium scraper created by Otávio Simões Silveira and build upon it to meet my specific needs.
Specifically, I’ll want to collect the following data for both the home & away team in each match:
- Final and halftime scores
- Match stats (possession, yellow cards, shots taken, etc.)
- Specific minutes marker indicating when goals were scored

You can find the scraper I used here.
The Data: Preprocessing in R
The first thing I will want to do is preprocess in R and do some data wrangling which will allow me to carry out the desired analysis and experiments.
Let’s load in the necessary packages and our data, ensuring that the dates for each match are read correctly.
library(dplyr)
library(ggplot2)
library(ggpubr)
library(lubridate)
library(stringr)
library(zoo)
df <- read.csv('data/premier_league_matches.csv') %>% mutate(date = lubridate::as_date(date))

Next, we’ll remove any matches with limited fans during the 20–21 season. Only a handful were played in December 2020 when the league allowed up to 2,000 spectators in certain stadiums before reverting to fully closed doors again. These are outliers with unknown effects so it’s better to leave them out.
# matches with limited attendance (2,000 fans) in December 2020
limited_fans_matches <- c(59005, 58999, 59003, 59000, 58997, 59008, 59014, 59007, 59009, 59006, 59024, 59033, 59030, 59026, 59041)
df <- df %>% subset(!match_id %in% limited_fans_matches)
Let’s also create new columns to track fouls, points won, and results for each team.
df <- df %>%
mutate(home_yellow_cards_pct = home_yellow_cards / (home_yellow_cards + away_yellow_cards),
away_yellow_cards_pct = away_yellow_cards / (home_yellow_cards + away_yellow_cards),
home_fouls_conceded_pct = home_fouls_conceded / (home_fouls_conceded + away_fouls_conceded),
away_fouls_conceded_pct = away_fouls_conceded / (home_fouls_conceded + away_fouls_conceded),
home_points_won = ifelse(home_ft_score > away_ft_score, 3, ifelse(home_ft_score == away_ft_score, 1, 0)),
away_points_won = ifelse(away_ft_score > home_ft_score, 3, ifelse(away_ft_score == home_ft_score, 1, 0)),
home_result = ifelse(home_ft_score > away_ft_score, 'W', ifelse(home_ft_score == away_ft_score, 'D', 'L')),
away_result = ifelse(away_ft_score > home_ft_score, 'W', ifelse(away_ft_score == home_ft_score, 'D', 'L')))
To track the different minute intervals that goals are scored in, we’ll need to transform the format of the minutes in home_goal_mins
and away_goal_mins
so that each value is comma-separated.
df <-
df %>%
mutate(home_possession = home_possession/100,
away_possession = away_possession/100,
home_goals_mins = c(str_replace_all(str_sub(home_goals_mins,
2, -2), fixed(" "), "")),
away_goals_mins = c(str_replace_all(str_sub(away_goals_mins,
2, -2), fixed(" "), ""))) %>%
mutate(across(c('home_goals_mins', 'away_goals_mins'),
~ifelse(.=="", NA, as.character(.))))
Finally, we’ll create two equal datasets (288 matches each) for our experiment given that the league began playing with no fans on June 17, 2020:
- Control set: only matches with fans
- Test set: only matches with no fans
no_fans_df <- df %>%
filter(date >= '2020-06-17') %>% arrange(date) %>% head(288)
no_fans_df['fans_present'] <- 'N'
fans_df <- df %>%
filter(date <= '2020-03-09')
fans_df['fans_present'] <- 'Y'
matches_df <- rbind(fans_df, no_fans_df)
Now we have a final data frame with labels for that match type in our experiment. This will make it easy to compare the two!
Exploratory Data Analysis
The first thing we’ll look at is the distribution of goals based on minutes in a match for both home/away teams. Goals seem to come in the first or last minutes of half due to lack of focus, fatigue, and more risk-taking to win the game. Fan energy also can be high at various parts of the match.

Unfortunately, there doesn’t seem to be any real difference between the two match types. Most goals were scored in the middle of the match and at the end which is probably due to extra time being added.
Next – we’ll look at the match result differences between the two datasets.

As guessed, home teams won fewer games and lost more games when fans were not present. The number of draws was roughly the same. At first glance, this seems to indicate that audiences have positive impacts on home teams.
Let’s look at home success a bit differently by looking at the ratio of points won out of total possible points. We’ll call this the home advantage and aggregate it on a monthly basis.

Again, this visual supports our notion above that home advantage diminished gradually as stadiums closed.
If we plot the distribution of yellow cards in a match to the away team in a similar plot over time, we see a greater difference in the number of yellow cards conceded by the away team between the match datasets. Away teams appear to have it a lot easier when fans are absent!

One last area we’ll want to look at is the total shots taken by the home team.

There’s about a difference of 500 shots taken by the home team when fans are present. How many of those shots are actually on target though? Let’s check.

Hypothesis Testing
Although it is easy to understand the data visualizations, we can’t rely on them alone to determine the significance of fan presence or not. For that, we’ll need to carry out statistical inference tests on our experiment.
Statistical inference allows us to be more certain that any difference we see between sample data is significant to a certain degree and that the results observed weren’t due to chance alone.
We’ll run four tests to observe the following:
- The difference in means of points won by the home team (main test)
- The difference in means of fouls conceded by the home team
- The difference in means of yellow cards conceded by the away team
- The difference in means of shots taken by the home team
For each test, we will first calculate the statistical power of the test which will give us the probability that a test will find a statistically significant difference when such a difference actually exists in the population. In other words, power is the probability that you will reject the null hypothesis when you should. Then we’ll carry out the actual hypothesis test using student’s t-tests. We’ll use a 90% confidence interval for each one.
Power Tests
First, let’s calculate the power for our main test using thepwr
library in R. We’ll need to calculate the means of each sample, the total standard deviation, and a value called effect size which is defined as:

The way the test works is that you must set only one parameter equal to NULL
which is the one that you want to calculate given the other data points.
library(pwr)
meanF <- mean(fans_df$home_points_won)
meanNF <- mean(no_fans_df$home_points_won)
sd <- sd(matches_df$home_points_won)
effect_size <- (meanF - meanNF) / sd
pwr.t.test(n = 288, d = effect_size, sig.level = 0.1, power = NULL, type = "two.sample")
#######OUTPUT########
Two-sample t test power calculation
n = 288
d = 0.130097
sig.level = 0.1
power = 0.4665975
alternative = two.sided
NOTE: n is number in *each* group
And we’ll do the same for the remaining tests and consolidate the results.

How do we interpret the power? It is generally accepted that power should be equal to or greater than 80%. For our main test related to home points won, this means that we have a 47% chance of finding a statistically significant difference when there is one. This is not a good sign. But the power for the other tests is 95% or more, which is promising.
Statistical T-Tests
Next, we’ll run a t-test to conclude whether or not there is a significant difference between the average home points won with and without fans.
- Null hypothesis: The difference in means between the two samples is 0
- Alternative hypothesis: The difference in means is not equal to 0.
t.test(fans_df$home_points_won, y=no_fans_df$home_points_won, alternative = 'two.sided', conf.level=0.90)
#######OUTPUT#######
Welch Two Sample t-test
data: fans_df$home_points_won and no_fans_df$home_points_won
t = 1.5631, df = 573.85, p-value = 0.1186
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
-0.009373031 0.356595254
sample estimates:
mean of x mean of y
1.593750 1.420139
And once again, we will run the test for all of our various hypotheses.

To reach significance, the calculated p-value must be less than the significance level used, 0.10
in this case, which allows us to reject the null hypothesis. For our main test related to a difference in points won by the home team with and without fans, the p-value is 0.1186
indicating no statistical significance. However, the remaining tests are all highly significant with p-values close to 0!
Summary: Key Findings and Takeaways
Overall, this experiment shows that commonly observed home advantage in football is minimized in some ways when fans are absent.
- Home teams tend to score fewer goals (and earn fewer points as a result) but the data observed doesn’t have statistical significance to support this.
- However, the data does support the observation that home teams receive more fouls without fans, and relatedly, that away teams concede fewer yellow cards. This makes sense as another key factor of a match’s result is the referee who is subject to pressure from fans to award or not award fouls.
- Further, home teams take a lot fewer shots in the absence of fan pressure, but that doesn’t necessarily result in a big difference in shots on targets or goals.
These elements can all contribute to a more balanced game and therefore reduced home advantage.
Outside of fan presence, many factors could have impacted the results observed. For example, the difference in match results between the two datasets could be unbalanced. If higher quality teams had more away games in one sample set compared to the other, this could introduce more bias. Many may argue that player/coach skills are more determining of a match result over whether the match is home or away.
Building models that address these biases and gathering more data on this rare occurrence in the sports world may help provide more answers. Until then, fans seem to impact the home advantage in football, but not in the purest way we would expect.
Hope you enjoyed the project. If interested in a similar article digging into the breakdown of the data by teams, check out my other post.
And thank you to Ken Jee for the inspiration in pursuing a sports analytics project. If you have any questions please feel free to comment below or contact me via LinkedIn.
References
- Balmer, N., Nevill, A., & Wolfson, S. (Eds.). (2005). Home advantage [Special issue]. Journal of Sports Sciences, 23(4).
- Loughead, Todd & Carron, Albert & Bray, Steven & Kim, Arvin. (2003). Facility familiarity and the home advantage in professional sports. International Journal of Sport and Exercise Psychology. 1. 264–274. 10.1080/1612197X.2003.9671718.
- Mcleod, Saul. "Effect Size." Simply Psychology, Simply Psychology, 10 July 2019, www.simplypsychology.org/effect-size.html.
- Silveira, Otávio Simões. "How to Build a Football Dataset With Web Scraping." Medium, Level Up Coding, 20 Oct. 2020, python.plainenglish.io/how-to-build-a-football-dataset-with-web-scraping-d4deffcaa9ca