The world’s leading publication for data science, AI, and ML professionals.

Predicting the Market Value of FIFA Soccer Players with Regression

A case study with Linear, LASSO, Ridge, Elastic Net and Polynomial Regression.

Photo by Fauzan Saari on Unsplash
Photo by Fauzan Saari on Unsplash

Three weeks have passed since the Metis Data Science Bootcamp started, and the journey has been nothing short of intensive but exciting. In this article, I detail the project that I have built using the skills and knowledge on web scraping and regression that I have carefully learnt over the past 2 weeks. After touring through a variety of websites, I finally settled on an interesting topic – prediction of FIFA soccer players’ market value!

What is Market Value of a Soccer Player?

When we talk about the market value of a soccer player, we refer to the estimate of the amount his soccer club can sell or transfer his contract to another club. And soccer clubs can pay astronomical amount to obtain the top players. As we shall see later, market values of soccer players follow an exponential trend, indicating that the a small subset of players are highly valuable. Hence, the ability to predict market values may offer a commercial advantage to some of these rich soccer clubs.

What are some of the most important factors that drive market value? Upon doing some research, I list the following factors:

  • Footballing Skills: As no soccer players are 100% versatile, footballing skills are categorized into several domains such as Defending, Shooting and Goalkeeping. And having high ratings in any one of these domains can raise your market value.
  • On-Field Position: On average, strikers are valued more than midfielders, followed by defenders.
  • Age: Or the Coefficient of Youth. Naturally, younger players will command higher market value, as they have greater potential for growth and can have longer service terms.
  • Media Coverage: Or the Media Coefficient. Generally, having popular players, who make a greater media impact, can generate more revenue for the clubs.

Nonetheless, because some of the data on these factors are difficult to obtain, I will attempt to create a viable regression model to predict market value based on only footballing skills and age. Below I detail the process and results obtained from this project.

Web Scraping with Beautiful Soup

I web scraped the FIFA Index website with the classic Python module Beautiful Soup. In total, I collected data from 19,401 players, which includes their height, weight, age, preferred foot and skill ratings. Each skill rating are sub-categorized into domains, which are scored from 0–100. The skill ratings are therefore web scraped by taking the mean of their domains. Skills ratings of footballers and their respective domains are given as:

  • Ball Skills: Ball Control, Dribbling
  • Passing: Crossing, Short Pass, Long Pass
  • Defense: Marking, Slide Tackle, Stand Tackle
  • Mental: Aggression, Reactions, Attack Position, Interceptions, Vision, Composure
  • Physical: Acceleration, Stamina, Strength, Balance, Sprint Speed, Agility, Jumping
  • Shooting: Heading, Shot Power, Finishing, Long Shots, Curve, Free Kick Accuracy, Penalties, Volleys
  • Goalkeeping: Positioning, Diving, Handling, Kicking, Reflexes

Below is the script that I have used to web scraped all my data:

After doing some slight processing and and a few removing players without market value data, here is the summary of the data frame:

Exploratory Data Analysis

Plotting the heatmap of features and target (Market Value) reveals some interesting trends: Height, Weight, Age, Preferred Foot and Goalkeeping seems to be uncorrelated with the target variable:

Image by Author
Image by Author

Thus, the question is should we remove these features? We already understand from domain knowledge that age is an important predictor of Market Value. How about Goalkeeping skills? Note that Goalkeeping is negatively correlated with other skill metrics. This is not surprising as goalkeepers are mostly specialized, and are seldom playing on field. However, some of these goalkeepers are highly valued as shown in the pair plot:

Image by Author
Image by Author

Therefore, removing the Goalkeeping feature will probably caused the predictions for the market value of goalkeepers to fall and also make the model unstable. The pair plot also shows that the distribution of the market value is exponential, revealing a winner-takes-all scenario.

Feature Selection and Engineering

Hence, we proceed to take logarithm of the Market Value in an attempt to linearize the data. Furthermore, the features Height, Weight and Preferred Foot are dropped. And as shown in the OLS regression summaries plotted using the Statsmodels package, the model improved drastically:

Image by Author
Image by Author

Not to mention that the Adjusted R² values improved from 0.32 to 0.67, the Log-Likelihood also increases, demonstrating better goodness of fit. Furthermore, the residuals follow a much more normal distribution with the Skew reducing to almost 0 and Kurtosis reducing to almost 3, fulfilling the normality assumption. In addition, the correlation with the target variable improved hugely across the features.

Other insights that we can observe from the OLS summaries are the negative coefficients of the Age and Passing features. While we already know that Age is inversely correlated with Market Value, the negative coefficient for Passing comes as a surprise. This means that if a player improves his Passing skills, on average, his Market Value will fall consequently! A possible explanation is that the market rewards more dominant players, and players who are better at passing might be weaker in other areas.

Model Selection

Now, to fit the data, the regression models that we wish to compare are Simple Linear Regression, LASSO Regression, Ridge Regression, Elastic Net Regression and Polynomial Regression. Note that LASSO, Ridge and Elastic Net are simply regularized versions of the Linear Regression that imposed additional regularized terms to the cost function that we want to minimize. Each of these regularized terms contain an hyper-parameter Alpha with a value that is chosen during model selection. Before proceeding, we randomly split the entire data frame into 80% Training Set and 20% Test set, where we hold out the Test Set for final model evaluation. For model selection, we further randomly split the Training Set into 5-Folds for cross-validation.

Image by Author
Image by Author

By taking mean scores (Mean Absolute Error) on the validation set in the 5-Fold CV, the optimal Alphas are selected for LASSO, Ridge and Elastic Net. The optimal degree is also chosen for Polynomial Regression. The optimal Alphas and degree are therefore:

  • LASSO Regression: Approx. 0
  • Ridge Regression: 0.93
  • Elastic Net Regression: Approx. 0
  • Polynomial Regression: Degree 4

Because the Alpha values for LASSO and Elastic Net are negligible, they can be approximated as the Linear Regression model. Comparing the performance of the remaining models, it is clear that Polynomial Regression (Degree 4) is the winner:

Model Evaluation

It’s time to score our Polynomial Regression model on the Test Set! Upon scoring the test set on Log Market Value, the results are very comparable still with r² of 0.930, Mean Absolute Error of 0.279, although the Kurtosis is around 1.

Image by Author
Image by Author

The Q-Q plot is also pretty linear, representing a good model fit on the data. However, before we get too happy with the results, we should be reminded that we have taken logarithm of the target variable earlier. What happens when we "un-log" our predicted target variable and compare with the original market value in the Test Set?

Image by Author
Image by Author

Scoring the test set on Market Value, **** without taking logarithm, we receive a r² of 0.836, Mean Absolute Error of € 668,684, and a Kurtosis of 234! The results are surprising, however I should highlight that this is the best model we could possibly achieve with regression and the limited features that we chose.

Below is the snapshot of the final data frame of the Test Set comparing the market values and predicted market values. While the predictions generally follow the trend of the actual Market Value, we are able to observe some outliers. In addition, the model also appear to favor Defenders, as evidenced by placing Virgil van Dijk and Giorgio Chiellini in the top 5 rows. A more accurate model would have favored Strikers instead.

Conclusion & Future Work

Nonetheless, the results of the finalized model is rather respectable given the limited features that I used. In the future, should other data sources be available, important features such as on-field position of players and media coefficient should be considered.

With that being said, I look forward to learning more Machine Learning techniques from the Metis Data Science Bootcamp in the coming weeks! Who knows? With more machine learning tools such as neural networks, we may revisit the project and engineer an even better model.

Here is the link to my GitHub, which contains all the codes and presentation slides for this project.

What do you think of my model? Reach me on my LinkedIn or comment here below to discuss!

P.S. Thanks for reading for reading my work on Regression. If you are also interested in learning about classification, check out my next project:

Predicting Satisfaction of Airline Passengers with Classification

Support me! – If you are not subscribed to Medium, and like my content, do consider supporting me by joining Medium via my referral link.

Join Medium with my referral link – Tan Pengshi Alvin


Related Articles