Can Linear Models predict a Footballer’s Value?

Shubham Maurya
Towards Data Science
9 min readJun 22, 2018

--

Photo by Emilio Garcia on Unsplash

In the spirit of the World Cup 2018, I’ve decided to present a project I did recently, which incorporates my two strongest interests — data science and football! The aim is to see if there’s a relationship between a player’s popularity and his market value in the English Premier League, given that it is difficult to properly value a player simply by his statistics. A simple example is how defensive midfielders generally seem to do less statistically, but are very valuable to any team nonetheless. I’ve also digressed into some interesting observations about players and the top 6 teams!

The data used here has been scraped from a variety of sources, including transfermrkt.com and Fantasy Premier League. It contains all the players listed on the FPL site for each team, who have a corresponding market value. For example, Scott McTominay is listed in United’s FPL squad, but didn’t have a market value on transfermrkt.com, which means he was excluded from the dataset. It is a comprehensive dataset of all players competing in the Premier League in the 2017/18 season, which were confirmed by 20th July. Consequently, there may be some omissions who signed on later.

The scraping uses some cool techniques with RVest and Selenium — click here for more details.

Some Preliminary Analysis

Who are the most valuable players in the EPL?

As expected, some of the biggest names in the game are playing in the English Premier League.

Who are the most popular players?

Rooney is an obvious bet for number 1, playing for Manchester United and being a Premier League legend.

Distribution of Market Value

Clearly not a normal distribution, but this was expected. Teams tend to have few elite players, and a large number of low + mid value players in their squads. An analysis of a team’s 1st 15 would probably look more like a normal distribution, since we’d be excluding low value fringe / youth players.

Does it look different for the Top 6?

1 indicates top 6, 0 indicates other

Interesting. The top 6 seem to have a spread of players, whereas the others have a large majority of players worth under 10 million (transfermrkt’s valuation, not mine).

Distribution of popularity

Similar distribution to market value, except the 2 outliers at the end — Wayne Rooney and Paul Pogba. While Rooney is already the most well-known (popular is debatable) current English footballer, he also happened to break Sir Bobby Charlton’s record of most goals for Manchester United. This, alongside the constant speculation over his United career, definitely led to his heightened page views. Paul Pogba, on the other hand, is a combination of intense scrutiny (of being the world’s most expensive transfer) (Update: not anymore!), a return to Manchester United (can definitely see people looking him up for that), and also the fact that he is a very marketable, visible player.

Top 6 vs the rest

Graph 1 indicates top 6 teams, 0 indicates other teams

Again, the top 6 clubs seem to have a spread of players popularity. Also, Wayne Rooney is at Everton now, explaining the outlier for other teams.

Detailed Analysis

Clearly, the case I’m trying to build is that there seems to be evidence of a player’s market value being correlated with how popular he is. This is interesting because ability and performance are notoriously difficult to quantify in football. It varies with the position, the manager’s tactics, the opposition, the league, the ability of your own teammates, and so on. Consequently, valuing a player is very hard to do, though it has to be done anyway.
Websites like WhoScored have a score for each player for each match, and Fantasy Premier League places a value on each player’s head. It would be interesting to see if popularity can be used as a basic proxy for ability, which is what I’ll attempt through a regression model.

FPL Valuation

There seems to be nice agreement between the FPL value and transfermrkt value, despite the fact that FPL valuation is decidedly shorter term, so age would be less of a factor. I was expecting to see more players in the bottom right — older players with low market value, but high FPL value, theoretically like Petr Cech and Yaya Toure. Maybe there’s a better way of highlighting that.

This seems about right. If FPL valuation were equivalent to transfer market value, we’d see a constant ratio, across age groups. But the fact that the lowest FPL value is 4 million, very young and unproven players have a low ratio. Similarly, at the other end old players have very low market values, but they may still be valuable over the next season.
What’s interesting is how the ratio for forwards falls off a cliff beyond 32, possibly implying very low market valuations for them.

Market Value with Age

It is fairly intuitive that older players will, on average, have lower market values. A rough illustration -

The high value players are clustered around the age of 24–32, peaking at about 27. It’s important to note that this is in no way a linear relationship, which is why I use age categories in the regression model that follows. An alternative would be to do a change-point regression, which means building 2 models, one where age < threshold, one where age >= threshold.

Who’s stocking up at which position?

Manchester City have forwards and attacking midfielders with huge potential, but their defence is very weak (not since Mendy, Walker and Danilo arrived — but they aren’t in this dataset). How do each of the top 6 stack up, in terms of positional strength?

Manchester City’s attack total market value is a long way ahead of the others. However, their and Liverpool’s defence is markedly weaker, which City have rectified. United’s keeping duo of De Gea and Romero is clearly the best amongst the Top 6.

Popularity as a proxy for Ability

As explained in the next section, we test the hypothesis that there is a relationship between ability and popularity. Ability is difficult to measure and compare through performance indicators. For the purpose of this section, I assume FPL valuation is a fair measure of ability. While this may not be perfect, we should still be able to se a relationship between ability and popularity.

There seems to be a nice, linear relationship between FPL valuation and popularity, with a few notable exceptions (Wayne Rooney, sigh). Wonderful! This will help in the model below.

Regression Model

The main aim is to see whether market value can be determined using popularity as a proxy for ability. A player’s market value can intuitively be represented as -

market value ~ ability + position + age

This should be read as market value is a function of ability, position and age.

The last 2 are easily observable, but ability is a difficult attribute to measure. There are a variety of metrics used for this, but I’ve decided to use a simple proxy for it — popularity (or more specifically, Wikipedia page views over the last year). I chose Wikipedia views for the following reasons -

  • Better than Twitter/Facebook since it’s not dependent on whether the player has a profile or not.
  • Better than Facebook/Instagram followers since those are subject to how engaging the players’ posts are, as well.
  • Was easy to get for the timeframe required — I wanted to exclude May — July, since it would inflate the popularity of players linked with a transfer in 2016/17.

Using page views has its own problems of correlation with other factors -

  1. Players from England itself may get more hits, since they’re playing in their home league ie nationality of the player may matter.
  2. Different categories of players get different levels of attention — forwards are definitely much more popular than defenders!
  3. New signings may get more attention, even beyond the transfer season.
  4. The top clubs have a much larger international audience.
  5. Breakout players may get a surge of hits, since they were virtually unknown before that. Think Marcus Rashford in 2016/17.
  6. Players with long-term injuries may have far fewer hits, simply because they haven’t been playing.

In the model, I control for 1–4, but not for 5 and 6. Both 5 and 6 would require extensive work identifying breakouts and long-term injuries, which might be useful future additions to the model.

For factors 1–4:

  1. Retrieved the nationality of each player, and put them into 4 buckets:
  • 1 for England
  • 2 for EU (Brexit made this a natural classification)
  • 3 for Americas
  • 4 for Rest of World

A new column called region was made, as a factor with 4 levels.

  1. Included an interaction term for page views and position category.
  2. Marked the new signings of 2016/17, and interacted that with page views.
  3. A column big_club was created comprising of United, City, Chelsea, Arsenal, Liverpool and Tottenham. This was interacted with page views as well.

Apart from these interactions, age is also included as a categorical variable (due to its non-linear relationship with market value).

Dataset Modifications

  1. The newly-promoted clubs are excluded from the dataset, simply because the Premier League offers a much higher level of publicity, which these clubs weren’t exposed to in the previous year.
  2. New signings for the 17/18 from abroad are also excluded, for the same reason. However, players who were transferred within the Premier league are retained. This means Lindelof is excluded, but Lukaku is not.
  3. sqrt values of market_value are taken, because market_value is right-tail heavy, which could lead to heteroscedasticity.
  4. However, this leads to the relationship between sqrt(market_value) and page_views looking like this -

I apply a sqrt transform on page_views as well, to get the following graph -

This looks roughly linear, with Wayne Rooney a major outlier.

Now applying a multiple linear regression model on this data yields an R² value of over 70%! Further, the coefficient of page_views is extremely significant. Clearly, there is a linear relationship between sqrt(market_value) and sqrt(page_views).

What can residual plots tell us?

The residual plots should be able to tell us whether we have a heteroscedasticity problem in our data.

Error Distribution Plot
Q-Q Plot

The residual plot seems to have randomly distributed errors, and the qq plot confirms that they are normally distributed.

EPL Popularity

An interesting by-product is to see how popular the Premier League is, compared to other leagues. Due to the small number of inward-transfers from foreign leagues, this remains a rough method. However, the differences are large enough to be greater than just noise.

We know that the model definitely works because it has generally undervalued players from other leagues. The reasoning is thus — a 20 million player in the EPL gets more hits than a 20 million player in Ligue 1. Because of this, the value of each page view is far lower in the EPL. But since the model is built using EPL data, the coefficient of page views is derived from EPL. Consequently, foreign players from less popular leagues are undervalued.

I hope you’ve found this to be as exciting as I did. Please upvote if you found it interesting!

To access the data and codebase used in this project, click here.

--

--