The world’s leading publication for data science, AI, and ML professionals.

Replacing Lewa!

Using Machine Learning to find potential successors to the great Robert Lewandowski .

Succeeding Lewa!

Using K-Means Clustering for finding players similar to Robert Lewandowski of Bayern Munich and visualising the results.

Photo by Nur Bayraktepe on Unsplash
Photo by Nur Bayraktepe on Unsplash

This one is for all the football crazy data scientists! Watching Robert Lewandowski banging in goal after goal every week seems very normal since a good number of years now. His career has been filled with goals at every stage, culminating last season with the magnificent treble. Though he shows no signs of slowing down we must acknowledge that he is 32 now, well beyond a striker’s prime and is likely to start declining in the near future. When the time comes Bayern will find it quite difficult to find an apt replacement and rightly so, players like him are a rarity. Personally, for me, he is the best no. 9 in world football. This article uses K-Means Clustering on a Fifa 21 dataset to find players with similar attributes to those of Lewa. Let’s dig in!


The Dataset and K-Means :

I have used the Fifa 21 dataset from Kaggle for this approach. The link contains datasets from Fifa 15 up to 21. Now coming to K- Means Clustering.

Since the problem requires finding similar traits to those of Lewandowski’s I decided to use an unsupervised technique in clustering to find the similarities.

K-Means Clustering is a technique in which given the number of clusters N, data can be divided into these N different clusters based on the features of each data entry. The features are numeric and the clusters are formed iteratively. Data within each cluster is similar to other data within the same cluster. This similarity can be decided based on different metrics two of them being the Euclidean Distance and the Manhattan Distance. To know more about K-Means please refer to this link.

The features that I chose for clustering the players were based on physical and technical attributes and were more focused on a striker’s attributes. So don’t expect Messi to show up in the same cluster as Lewandowski since they are miles apart physically!


Looking into the Code!

Part 1: Understanding the data and selecting features

In this part we first observe the dataset, then based on the columns we make a dictionary with keys as the features we would like to use and values as their corresponding column indices in the dataframe.

Here are some rows and some columns from the dataset
Here are some rows and some columns from the dataset

We then drop the rows which have empty values. Feel free to go through the features I have chosen and add/remove some you want. All the features are numeric and contribute to a player’s overall game. I have included defensive and midfield features to to compare in much more detail.


Part 2: Determining the number of clusters and applying K-Means

K-Means requires us to determine the number of clusters chosen and choosing the correct number is integral to the algorithm. An ideal way to figure out the right number of clusters would be to calculate the Within-Cluster-Sum-of-Squares (WCSS).WCSS is the sum of squares of the distances of each data point in all clusters to their respective centroids. The goal is to minimise the sum.

The WCSS formula
The WCSS formula

We then plot the WCSS for a range of N and observe the graph called the elbow curve.

The elbow graph for our case
The elbow graph for our case

Now we must choose a value of N(x-axis) after which the drop in WCSS is minimal or linear. As we can see that value in this graph is 6. Hence we must apply K-Means with N as 6 which will eventually give us 6 clusters.

I have used Sklearn to carry out the clustering and this worked pretty fast. The parameters used are number of clusters, number of iterations and the initialisation method. Please note that it is important to initialise the cluster centroids well to avoid problems. Random Initialisation often leads to a trap and hence to avoid this we must use the k-means++ initialisation.

I will not go into the details of the initialisation but curious readers can look this up. Below is the code for this part


Part 3: Finding strikers in the same cluster as Lewandowski and using PCA to visualise some similar strikers.

Finding players in Lewandowski’s cluster is easily achieved through a few smart loops. I made sure to filter out players who were not out and out strikers in the list.

Knowing that our data has many features (38 to be precise) we can’t visualise the results as of now. Hence we must lower these 38 dimensions into 2 and make sure that these 2 dimensions capture most of the variance in the 38 features and projecting into this vector space causes minimal loss of information. This in a nutshell is dimensionality reduction and can be achieved using the PCA Algorithm. I will not go deep into the PCA, especially it’s derivation since it is replete with linear algebra which everyone may not be familiar with, but for those readers who love exploring the mathematical nuances refer to this comprehensive article describing everything about the PCA.

Here are the simple steps for PCA

  1. Scale the data and standardise it.
  2. Compute the covariance matrix
  3. Find the eigenvalues and vectors for this matrix
  4. Sort the eigenvalues in descending order of magnitude and then sort the eigenvectors corresponding to these eigenvalues in the same order.
  5. Pick the first n eigenvectors. (n is the number of dimensions in the projection plane.)
  6. Return the matrix product of (eigenvectors.T,data.T).T where T is the transpose of a matrix. (Refer to the code for the shapes of the elements involved)

After getting the transformed vector I picked 15 top strikers(by overall) for visualisation. The graph beautifully adds value to mere numbers and puts together a cogent story.


Here is the result!

The graph showing 14 players similar to Lewandowski
The graph showing 14 players similar to Lewandowski
The details of all the similar players
The details of all the similar players

Conclusion and Analysis

I was pretty happy with the results, since I would personally pick a few of them from this list! The most interesting thing to note is how close Aubameyang is to Lewandowski. This is well reflected as both are goal scoring machines, similar in attributes and have exactly the same age! Others like Lukaku and Kane are a bit far due to the difference in age probably. Personally I would say that Kane would be the best possible candidate but too expensive! He is in his absolute prime and a very very well rounded player! Lukaku and Icardi would do well too! Eden Hazard was a surprise here as he is very different in terms of physique to what Lewandowski is. It was nice to see players like Ilicic who was so vital to Atalanta’s success last season and Dzeko who is still going strong at 34 included in the list. Immobile makes a strong case too! Hopefully Lewa keeps scoring for a long time though.

This is my first Medium article, so please let me know what you thought of it. Please leave some claps if you enjoyed it! Check out my github for some other projects. You can contact me here. Thank you for your time!

The entire code is given below


Related Articles