Buying A Soccer Team: A Machine Learning Approach

Published in

Towards Data Science

15 min readAug 14, 2020

An approach that is better than random guessing or choosing players from a pool of 18000 professional players.

As we are progressing into a world where sports have become a vital part of our lives, it has also become a hot market for investors to gain better returns and interact with the audience and make their presence felt. Also, we can see that there has been a surge in sports viewership which leads to more tournaments, and capitalizing on them has become a difficult task for an investor. We have taken up a challenge to help major investors to pick the best players amongst 18000 Soccer players to build a dream Soccer team which can participate and outperform other clubs in major leagues. We have leveraged Machine learning algorithms to classify potential team members in our club and potential budget for an investor to optimize their market gains. As a result, we have come up with a strategy to build the best team whilst keeping in mind the investor’s budget which has a limit of 1 billion Euros.

INTRODUCTION

We have a FIFA dataset in which there are few columns named — rating, release clause, and wages. We assume that we do not have these variables in upcoming out of time datasets. These can be used in various formations like rating a player as a better performing player or moderate or not up to the marked player. Additionally, it can also translate into a player who should be sent an invite to some club gatherings, events, etc. We built 2 models using Supervised Learning on Rating variables and we made it a classification problem by splitting this variable into 2 classes: greater than or equal to 70 and less than 70 as our potential club member or not. We chose 70 as our threshold as most of the major clubs have only players whose rating is greater than 70. In order to compete with them, we are inclined to have only those players having ratings greater than our threshold. Additionally, we have worked with our model results to predict the cost to investors for offering a club membership annually. Our second model utilizes predicted rating class obtained from our previous best classifiers instead of “actual rating” and here, and a combination of release_clause and annual wages as the cost to investors as our dependent variable.

DATASET

We are using the FIFA 2019 and 2020 data from the Kaggle FIFA complete player dataset. FIFA complete player dataset contains 18k+ unique players and 100+ attributes extracted from the latest edition of FIFA. It contains:

Files present in CSV format.
FIFA 2020–18,278 unique players and 104 attributes for each player. (Test dataset)
FIFA 2019–17,770 unique players and 104 attributes for each player. (Train dataset)
Player positions, with the role in the club and in the national team.
Player attributes with statistics as Attacking, Skills, Defense, Mentality, GK Skills, etc.
Player personal data like Nationality, Club, DateOfBirth, Wage, Salary, etc.

DATA CLEANING

In some places, both the datasets have different data types of the same features. After reading the data dictionary, we brought them into sync.
Some variables have in-built formulas, so we corrected their formatting.
We removed ‘sofifa_id’, ‘player_url’, ‘short_name’, ‘long_name’, ‘real_face’, ‘dob’, ‘gk_diving’, ‘gk_handling’, ‘gk_kicking’, ‘gk_reflexes’, ‘gk_speed’, ‘gk_positioning’ and ‘body_type’ based on dictionary definitions or repeated columns as they add no useful impact in our analysis.
We converted overall ratings into 2 binary classes, with rating > 70 (as many big clubs use this threshold) to recruit their team players and will be treated as our dependent variable

EXPLORATORY DATA ANALYSIS

We first considered various interesting statistics for performing our exploratory data analysis.

Univariate statistics like Missing values percentage in the whole data to treat the missing values, univariate statistics of the continuous variables (count, mean, std, min, max, skewness, kurtosis, unique, missing, IQR) and their distributions.
Bivariate statistics: Correlation among the features and T-test for continuous variables and chi-square test & Cramer’s V for categorical variables.

Univariate

We performed univariate analysis on the continuous variables to get the sense of the distribution of different fields in our dataset. According to our observation (mean, std, skewness, kurtosis, etc.), we observed many key features that follow a normal distribution. Moreover, the Interquartile range (IQR) was used to detect outliers using Tukey’s method.

For the categorical variables, the univariate analysis consists of their count, unique values, categories with maximum counts (i.e., top), their frequency, and the number of missing values they have. From the categorical table, we can see that player_tags, loaned_from, nation_position, player_traits have more than 54% of missing values. It would not be easy to impute these with any promising values.

Bivariate

For continuous variables

We built a correlation matrix to get a sense of the extent of the linear relationship between rating and other explanatory variables and which variables can be excluded in later stages. We used the seaborn package in Python to create the above heat map.

T-test

We also performed a t-test to check whether the mean of the variables when rating = 1 is significantly different from the mean of the variables when rating = 0. After this stage, we removed some variables which are either not significant or having no correlation at all with dependent variables.

For categorical variables

We performed a chi-square test to check the significance of the variables with the dependent variable rating. The table below contains p values corresponding to categorical variables. we obtained that preferred_foot is not significant in our analysis.

To find the correlation between categorical variables with the dependent variable, we applied Cramer’s V rule.

V equals the square root of chi-square divided by sample size, n, times m, which is the smaller of (rows — 1) or (columns — 1): V = SQRT(X2/nm).

Interpretation: V may be viewed as the association between two variables as a percentage of their maximum possible variation. V2 is the mean square canonical correlation between the variables. For 2-by-2 tables, V = chi-square-based measure of association.
Symmetricalness: V is a symmetrical measure. It does not matter which is the independent variable.
Data level: V may be used with nominal data or higher.
Values: Ranges from 0 to 1.

In this scenario, we kept columns that showed a decent correlation between independent variables and dependent variables. These are ‘club_new’, ‘Pos’, ‘attack_rate’, ‘nation’

FEATURE ENGINEERING:

1. Re-categorizing/imputing Variables

Since team_jersey_number, nation_jersey_number is not actually a continuous variable, we have decided to treat them as categorical variables.
We further imputed team_position with ‘not played’ for the missing values and re-categorized the players into defender, attacker, goalkeeper, resting, Mid-Fielder, substitute, not played, to reduce 29 unique values to 7 levels.
We conjecture that a goalkeeper will have minimum values for ‘pace’, ‘shooting’, ‘passing’, ‘dribbling’, ‘defending’, ‘physic’ thus imputing with such values.
Moreover, 2 variables — nationality and club have very high cardinality. Based on their volume and event rate, we have re-categorized them into low cardinal variables.

2. Creating Variables:

From the data, we observed that ‘player_positions’ gives the idea about players' multiple playing positions. so, we have decided to assign individual players with the total count of their availability at different on-field positions into ‘playing_positions’.
A player’s work_rate is given by his attack and defense rate; thus, we have separated them into variables.
We have also calculated the term an individual player will be associated with the club to better understand their loyalty with the club.
We have also used one-hot encoding to utilize categorical variables in a form that could be provided to ML algorithms to do a better job in predictions.

3. MODEL 1

Here, Y = Rating with population event rate as 31.23 % (which is class 1)

3.1. Logistic Regression:

For the logistic regression model, we first performed the classification without regularization followed by a ridge and lasso regression. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the large datasets encountered in many practical settings.

The objective of Logistic Regression while applying a penalty to minimize loss function:

The best result received from running the logistic regression models pre and post regularization (L1 and L2) can be summarized below:

3.2. KNN:

kNN is a case-based learning method, which keeps all the training data for classification. One of the evaluation standards for different algorithms is their performance. As kNN is a simple but effective method for classification and it is convincing as one of the most effective methods it motivates us to build a model for kNN to improve its efficiency whilst preserving its classification accuracy as well.

Looking at Figure 1, a training dataset including 11 data points with two classes {square, triangle} is distributed in 2-dimensional data space. If we use Euclidean distance as our similarity measure, many data points with the same class label are close to each other according to distance measure in the local area.

For instance, if we take the region where k=3 represented with a solid line circle and check the majority voting amongst classes we observe that our data point {circle} will be classified as a triangle. However, if we increase the value of k =5 represented by the dotted circle, our data point will be classified as a square. This motivates us to optimize our k-Nearest Neighbors algorithms to find the optimal k where the classification error is minimal.

Experiment:

We initially trained our k-NN model with k=1, with splitting our data into 70% -30% as our training and validation data. From table 2, we observe that training accuracy is 1 which implies that the model fits perfectly, however, the accuracy and AUC for the test data are higher than validation data, which is indicative of overfitting, therefore, we are subjective to perform parameter tuning.

Optimization:

We utilized the elbow method to find the least error on training data. After running for the best k, we observed that the least error rate is observed when k=7, Although our optimized results performed better in train and validation, our test AUC has reduced.

Even though the accuracy for the test is reduced, we observe that the precision-recall for the same has increased indicating that our model is classifying more class (1) better as that is our target class. (players with greater than 70 ratings).

3.3. DECISION TREE:

The decision tree method is a powerful statistical tool for classification, prediction, interpretation, and data manipulation that has several potential applications in many fields.

Using decision tree models has the following advantages:

Simplifies complex relationships between input variables and target variables by dividing original input variables into significant subgroups.
A non-parametric approach without distributional assumptions so, Easy to understand and interpret.

The main disadvantage is that it can be subject to overfitting and underfitting, particularly when using a small data set.

Experiment:

We trained our Decision tree classifier from the Sklearn library without passing any parameters. From the table, we observed that there is overfitting of the data thus we must tune our parameters to get optimized results.

Optimization:

We worked with the following parameters:

criterion: string, optional (default=” Gini”):

max_depth: int or None, optional (default=None):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split: int, float, optional (default=2):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_weight_fraction_leaf: float, optional ():

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided

From the experiments above, we see that Gini is outperforming the entropy in all the variants of the experimental parameters. Thus, our criteria are Gini. Similarly, we can observe the other parameters for max_depth = 10, min_samples_split = 17.5, Min_weight_fraction_leaf =0, Gini gives higher accuracy. Thus, utilizing these parameters, we train our model to observe that there is no overfitting and we can capture more true classes in the class 1 category.

3.4. SUPPORT VECTOR MACHINES:

The folklore view of SVM is that they find an “optimal” hyperplane as the solution to the learning problem. The simplest formulation of SVM is the linear one, where the hyperplane lies in the space of the input data x.

In this case, the hypothesis space is a subset of all hyperplanes of the form:

f(x) = w⋅x +b.

Hard Margin Case:

The maximum margin separating hyperplane objective is to find:

Soft Margin Case:

Slack variables are part of the objective function too:

The cost coefficient C>0 is a hyperparameter that specifies the misclassification penalty and is tuned by the user based on the classification task and dataset characteristics.

RBF SVMs

In general, the RBF kernel is a reasonable first choice. This kernel nonlinearly maps samples into a higher-dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear. Furthermore, the linear kernel is a special case of RBF since the linear kernel with a penalty parameter Ĉ has the same performance as the RBF kernel with some parameters (C, γ). The second reason is the number of hyperparameters which influences the complexity of model selection.

Experiments:

We subjected our training data to a linear SVM classifier without training it for soft margins. The results observed does look promising, however,

The reason for the good score was that the data was almost linearly separable most of the time with very few misclassifications.

Optimization:

We decided to run a grid search with a linear, and radial basis function with varying C and γ to train our model efficiently. From the Grid search, we obtained the best estimators as

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=’auto_deprecated’, kernel=’linear’, max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)

And for the radial basis function, we got our best estimators as

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=0.001, kernel=’rbf’, max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)

Since the generalization error (expected loss) is used to approximate the population error, we observed the Errorval in RBF kernel model is the smallest amongst other models. Also, this is our best model as it fits the data better than the rest of the models.

RBF kernel took the data into higher infinite-dimensional space which helped our model to stand out. The Precision-Recall curve shows us how well the positive class is getting predicted with AUC — 0.961.

4. MODEL 2:

Here, X is the same including the predicted rating from Model 1 and Y = Release clause + 52*Wage as the cost to investors. (52 is multiplied as the weekly wage is given).

After selecting significant variables, from Univariate and Bi-variate analysis as earlier, we plotted a scatter plot of independent variables with the dependent variables.

It is clearly visible that they follow a relationship, but it does not seem linear. We confirmed this by developing a linear model.

4.1. Linear Model:

Results:

R square train 0.54

R square validation 0.55

R square test 0.54

R square is the measure of closeness to perfect prediction. Here, R square is not good.

Checking Linearity from residuals: Data should be randomly scattered. But here, we figured out that they are not random. This means a linear model would never be a good choice to fit this model.

4.2. Decision Trees: This was the better choice than linear models in this scenario.

Results (Baseline):

Train Data: R square — 0.99 and RMSE — 0.05

Validation Data: R square — 0.54 and RMSE — 8.05

Test Data: R square — 0.59 and RMSE — 7.35

There was a clear indication of over-fitting. The model was not performing as expected. Therefore, we tried a grid search based on min_split, tree_depth, and min_weight_fraction_leaf and learning criteria.

As shown above, Entropy performed better with min_split=3 and max_depth=15.

Results after Grid Search: (Main Model)

Train Data: R square — 0.85 and RMSE — 4.40

Validation Data: R square — 0.69 and RMSE — 6.59

Test Data: R square — 0.70 and RMSE — 6.26

R-squared value seems far better now. RMSE value is also low and the problem of overfitting is also solved.

Hence, Decision Trees performed better here in order to predict the cost to investors.

Final Strategy:

The final step was to make a strategy to pick players for our team keeping in mind:

Rating should be greater than 70 (means class 1)
Budget — 1 billion Euros and the number of players around 30.

Firstly, we selected only the players who had ratings greater than the threshold of 70. Number of players left — 5276

Secondly, we performed some analysis like the decile analysis of the cost to investors. We made some buckets each having approximately 30 players from the remaining pool and sorted those buckets based on the cost to investors in descending order.

Here, we can observe that the amount needed to pick the whole team from the first bucket is 3.45 billion Euros (which is out of budget). That means we can’t pick the top 30 players directly and the amount needed to pick the team from the 11th bucket is 0.945 billion Euros (which is in our budget). However, it would be a wrong strategy to pick all the players from this bucket only as we’d leave almost 300 high valued players who are above this bucket. So, the best solution is to pick 8–10 core players from the top buckets and the rest of the players from medium and low-valued buckets.

This decision can be easily made by the above analysis and it is up to the investors and team managers to decide what kind of players they want in their team.

5. CONCLUSION:

In this work, we constructed 2 models that utilize Machine learning algorithms to benefit investors while simultaneously capturing the meaningfully classifying players as good performers and then regressing them in the budget of the investor. The result, classification, and regression fitting is a new selection model for the supervised learning of players that outperforms other teams. Ultimately, we have narrowed down the selection process of a player within a club, which is rather better than selecting at random.

Future Scope: We can also implement Time Series techniques. As our dependent variables — rating and cost both depend on previous years' data. For example — If a certain player has a rating of 85 in Dec’19, his rating in Jan’20 would be around 85 +/- 3. Therefore, Time Series techniques might be useful for this data.

References:

Guo, Gongde & Wang, Hui & Bell, David & Bi, Yaxin. (2004). KNN Model-Based Approach in Classification.
Yan-yan SONG, Ying LU. (2015). Decision tree methods: applications for classification and prediction.
https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
Apostolidis-Afentoulis, Vasileios. (2015). SVM Classification with Linear and RBF kernels. 10.13140/RG.2.1.3351.4083.