Sports Analytics

Some days ago, I was fortunate to be able to participate in a football analytics hackathon that was organized by xfb Analytics[1], Transfermarkt[2], and Football Forum Hungary[3].
As we recently received permissions to share our work, I decided to write a blog post about the approach I used.
The goal was to pick a Premier League team, analyze their playing style, highlight two flaws, and prepare two lists of 5 players each that could help the team improve. The premise was that we had to look to fill two different positions (hence the "two lists of 5 players each").
Then, from those two lists, we had to pick the top target in each and further explain why they were the best fit for their respective positions.
The final result had to be realistic and the sum of both players’ prices had to be below 60M (we were given their Transfermarkt valuations).
Now that you know what it was about, I want to talk about my approach. I’m a data science guy who loves football so I had to perform some sort of technical analysis or modeling with Python.
Here’s how I’ll structure this post:
- Introductory Analysis
- Player Clustering
- Picking the Defensive Midfielder
- Picking the Striker
- Conclusions
Take into account that, as said, this was a hackathon. The time to do it was limited and the resources were quite scarce. With proper time and enough data, the results would have been even better.
Introductory Analysis
When it comes to player recruitment, data is probably our best friend. It doesn’t guarantee anything in the future, but it allows us to understand the past and present of a player in a purely objective manner: his playing style, his profile, advantages and disadvantages…
For that reason, I wanted this project to be 90% based on data, and let common sense reign over the remaining 10%.
The team that I chose was Fulham. Why Fulham? First, I didn’t want any top team (i.e. Liverpool, Man City, Arsenal…). Among the remaining 15 teams, I didn’t care and I knew very little about them. All had to have their flaws and strengths, so all were possible options.
I did what someone in my position would do: let another person choose for me. It was my girlfriend and she picked Fulham because they were sitting in the 12th position, our number.
Having never watched Fulham, I started to dig in. I was surprised to see that I actually knew some of their players: Adama Traoré, Willian, Raúl Jiménez, Bernd Leno, and João Palinha. Not bad, right?
After reading some posts and watching some videos, here are some takeaways from their playing style:
- They play in a 4–2–3–1 formation that turns into a 4–4–2 (or 4–4–1–1) on defense.
- They like to play in a possession-based style, building from the back.
- A lot of their attacking play is focused on wide play and getting crosses into the box. On counterattacks, they also try to move the ball towards the flanks.
- You rarely see Fulham pressing high. They tend to let the opponent come out before facing their mid-block.
Okay, as in any data science project, getting to know the context of the data is crucial. At this point, I feel like we do. It’s now time to find the problems we need to solve (i.e. discover two of their flaws).
Fulham has had major struggles this season in two different areas:
- Build-up: They’ve lost a lot of balls this season trying to build from the back. The way other teams press them forces them to mostly rely on Palinha with the other defensive midfielder (DM) not being an asset for this possession-based football.
- Attack: Last season Mitrović was their key player. A traditional striker with lots of goals and quality. He left the past summer and the team hasn’t been able to find a proper replacement. Muniz is filling that position but he’s only scored 7 goals this season, quite the same number as Jiménez, Iwobi, Bobby Reid, or Willian.
What about the team’s transfer history? We could end this analysis by proposing Haaland and Rodri for the striker (ST) and defensive midfielder (DM) positions but remember we left a 10% for common sense? This is where we apply it.
We want to find players that fit within the team’s budget, philosophy, and status. Seeing Haaland in a Fulham jersey is almost impossible. So, we have to familiarize ourselves with the team’s past expenses:
![Player expenses by season for Fulham, as per Transfermarkt[2] - Image by the author](https://towardsdatascience.com/wp-content/uploads/2024/04/1fAjE25uiQCgB_C2aYM2nHA.png)
We see a huge shift in the summer of 2022. The team can now afford to spend around 70M and has invested in more expensive players. Therefore, a 60M budget fits well within the team’s position, and spending 30M or 40M on a single player would not be disproportionate.
Player Clustering
Let machine learning enter the conversation.
My goal was to create clusters of players with similar characteristics and find the best possible targets from those clusters that had what we needed.
So, with already cleaned data, I scaled it:
num_vars = ['minutes_played', 'goals_cnt', 'shots_total_cnt',
'shots_on_target_cnt', 'off_deff_aerial_duels_won_cnt',
'fouls_suffered_cnt', 'offsides_cnt', 'tackles_total_cnt',
'interceptions_total_cnt', 'clearances_total_cnt',
'fouls_committed_cnt', 'passes_total_cnt', 'passes_accurate_cnt',
'long_passes_total_cnt', 'long_passes_accurate_cnt']
onehot_vars = [
'AM', 'D', 'DM', 'FW', 'GK', 'M', 'Sub'
]
num_x = data[num_vars].to_numpy()
onehot_x = data[onehot_vars].to_numpy()
# Scale numerical vars
scaler = StandardScaler()
scaled_x = scaler.fit_transform(num_x)
# Concatenate scaled_x and onehot_x
x = np.concatenate((scaled_x, onehot_x), axis=1)
Here’s where the lack of time and resources might harm the most: I used a set of numerical variables that were available. I’d like to have used more and better features.
However, what I did was pretty simple: separated those numerical variables from the one-hot encodings and scaled the first group using scikit-learn’s StandardScaler[4]. Then, I created the x
array by concatenating the scaled numerical features and the one-hot encoding ones.
Those that I one-hot encoded were simply player-position data. I simplified them to have only goalkeepers (GK), defenders (D), defensive midfielders (DM), central midfielders (M), attacking midfielders (AM), forwards (FW), and Subs (Sub).
Once prepared, it was now time to create the clusters. I didn’t have any labels so I had to go with an unsupervised learning algorithm. The selected one was K-Means.
For those who don’t know, K-Means is a method that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (that’s called the centroid).
We know n is the number of players we have in our x array (2306) but what about the number of clusters (k)? The most used approach to determine the optimal value of k is called the Elbow method, which is a graphical method that resembles an elbow and it’s at that elbow point that we want to get our k from.
"Silhouette coefficients" would have been another option to use, but I kept it simple. Here’s how I plotted it:
sse = []
clusters = 50
for k in range(1, clusters):
kmeans = KMeans(
init="random",
n_clusters=k,
n_init=10,
max_iter=300,
random_state=42
)
kmeans.fit(x)
sse.append(kmeans.inertia_)
ax = sns.lineplot(x=range(1, clusters), y=sse)
ax.set_xlabel("Number of Clusters")
ax.set_ylabel("SSE")

The optimal value isn’t really clear at first sight… It could have been anywhere from around 8 until 15. But, another way to understand that optimal point is as "the point in which the value begins to decrease almost linearly". So 15 made sense.
# I'll use 15 clusters
k = 15
kmeans = KMeans(
init="random",
n_clusters=k,
random_state=42
)
kmeans.fit(x)
data['cluster'] = kmeans.predict(x)
To analyze those clusters, I’ll plot the parallel coordinates of the cluster centroids:
centroids_df = pd.DataFrame(kmeans.cluster_centers_, columns=num_vars+onehot_vars)
centroids_df['cluster'] = centroids_df.index
fig, ax = plt.subplots(figsize=(12,6))
pd.plotting.parallel_coordinates(centroids_df, 'cluster', ax=ax, colormap='tab20')
fig.autofmt_xdate()

This plot alone tells us a lot about each cluster. For example:
- Cluster 1 has players who score and shoot a lot. They also produce a lot of offsides… So most of these are probably strikers.
- Cluster 14 has players who play a lot of minutes but "underperform" in all stats except for long passes, in which they are way above the rest. These are probably goalkeepers.
Now that we have our clusters, the goal is to find those in which we want to look for our players. And the approach I used is fairly simple: I just looked where some of the top STs and top DMs were.
- For strikers, I looked for Lewanwoski, Haaland, and Kane among others. They mostly belonged to cluster 1 (as we had already guessed).
- For defensive midfielders, I looked for Rodri, Kimmich, and Zubimendi (among others). They mostly belonged to cluster 5.
So all that was left was to find players within these two clusters that, combined, had a market value below 60M, and their personal value wasn’t above 40M. Luckily for me, my dataset had also an "Average rating" column and I used that to sort those player combinations and pick the best ones.
Considering these filters and removing older players, here are the 5 best DMs based on the cluster:
- João Neves (Benfica)
- Johnny Cardoso (Betis)
- Pierre Less-Melou (Brest)
- Anton Stach (Hoffenheim)
- Stephen Eustaquio (Porto)
While João Neves’ level is probably beyond the others, it’s not a realistic target for Fulham. Neves is likely going to be signed by a top European club this summer and our team cannot compete against Man United, for example.
So, after applying that 10% of common sense and briefly analyzing their player profiles, the chosen one was Anton Stach. His playing style is extremely balanced, with most stats way above his peers in the same position, and with amazing possession characteristics (which is what we’re looking for).
He even has an added value compared to Palinha, and it’s his final third involvement:
![Anton Stach's final third vs box involvement - Image from Cube[5] with permission](https://towardsdatascience.com/wp-content/uploads/2024/04/1uSMMlf1aiAdrHNQ6BwmmwA.png)
As for the striker, here are our preferred options after searching in the desired cluster:
- Serhou Guirassy (Stuttgart)
- Ivan Toney (Brentford)
- Victor Boniface (Leverkusen)
- Jonas Wind (Wolfsburg)
- Evanilson (Porto)
Here, the decision wasn’t as easy. Boniface would probably have been the best target but, again, he might be a top-team target this summer (FYI he’s being the scoring leader of the best team in the Bundesliga).
So the chosen striker was Serhou Guirassy, from Stuttgart. He’s a very complete striker not only with good attacking characteristics but off-the-charts possession stats compared to other strikers.
What’s even more impressive is the amount of shots he produces and the quality of those, making him a player who can score from almost anywhere (and has done so). The two plots below visually explain what I just said:
![Shooting plots for Serhou Guirassy - Images from Cube[5] with permission](https://towardsdatascience.com/wp-content/uploads/2024/04/13P2p1VJEKFpNWpdIJIrJSg.png)
Conclusion
In this post, I shared the project I built for the Football Analytics Hackathon organized by xfb Analytics, Transfermarkt, and Football Forum Hungary.
I leveraged my data science skills to create a very simple K-Means model to create player clusters, find those with world-class players who play in the positions I needed to fill, and get other, more realistic targets from those clusters for Fulham.
We used the K-means algorithm without a predetermined number of clusters and we used the Elbow method to determine K. With 15 clusters, we analyzed each of those and found the one that had players with top-class numbers, and searched there for our potential signings.
I ended up suggesting Anton Stach (15M) for the defensive midfielder position to play alongside Palinha, to be able to build from the back, prioritizing possession. As a striker, I went for Serhou Guirassy (40M) who has been having an amazing season and who I believe could be a great alternative after losing Mitrović.
While this approach might seem extremely niche, it’s not. K-means can be used in several situations outside of football or sports, when trying to understand an audience, your clients or animal species. This unsupervised algorithm can be a powerful tool for data professionals for its simplicity and potential (there are more complex and better approaches too).
Thanks for reading the post!
I really hope you enjoyed it and found it insightful. There's a lot more to
come, especially more AI-based posts I'm preparing.
Follow me and subscribe to my mail list for more
content like this one, it helps a lot!
@polmarin
Resources
[1] xfb Analytics
[2] Football transfers, rumours, market values and news | Transfermarkt
[3] Football Forum Hungary 2024