The world’s leading publication for data science, AI, and ML professionals.

Embedding the Language of Football Using NLP

Using state-of-the-art NLP algorithms to build a representation for future machine learning solutions in the sports analytics domain

Hands-on Tutorials

Image by Arseny Togulev on Unsplash.
Image by Arseny Togulev on Unsplash.

There are 1.3 Billion people in the world speaking Chinese, making it the most common language in the world. Interestingly, the most popular sport, Football, has 4 Billion fans – more than 3 times more. Any sports game has strict rules and format, as grammar has, and a defined set of actions, as vocabularies have. Apparently, if football were a formal language, it would be the most popular in the world.

Inspired by this insight, in this work I try to project current state-of-the-art methods of representing humans language, aka Natural Language Processing, to represent the global language of football.

I will mainly focus on the motivation of creating such representation, as well as on how to create it, and, to some extent, validation of the results. My next posts will deal with developing explainers and UI on top of the representation and demonstrating its use for a variety of use-cases in the football domain.

This article is quite long. For your convenience, I added a table of contents. In case you are familiar with the NLP concepts mentioned, feel free to skip right into Action2Vec – embedding on-the-ball actions.

All code used for this work is available on the Football2Vec library on Github.

Table of contents

Motivation

First, any complex task involving mathematical models requires a representation, that is the way of feeding data to algorithms. Specifically, data representation is at the core of machine learning modeling. For example, if we want to determine whether a player may fit a specific club, we first need to define how to represent football players and football clubs. The representation should be as generic and descriptive as possible to fit a variety of use-cases. Textual embeddings meet these requirements.

Second and most importantly, I have always found football fascinating. I was curious to see where it could lead me. As data became more accessible, I knew it can be great fun, which by itself, is a justified cause to do anything, isn’t it?

Prior knowledge

This work includes some advanced concepts in Natural Language Processing (NLP) such as word embeddings, Word2Vec and Doc2Vec. For those of you who are not familiar with word embeddings – they are basically a dense, low-dimension representation of words, which follows the idea that the higher the semantical similarity between words, the closer they should be in space. To get a deeper sense of it, I highly recommend this review of word embeddings and this simple word2vec introduction by Zafar Ali.

On the practical side, if you want to know how to implement Word2Vec using Python, follow this guide for coding Word2Vec using Gensim (of course, many other packages may apply, such as Keras / TF / Pytorch / etc.). Finally, to plot the embeddings on 2D-space we used UMAP for dimensionality reduction, which is considered best practice.

The dataset

The data for this work is based on the Statsbomb open dataset. It contains ~900 matches and ~4K players (male and female) from various tournaments. Seasons range between 2003–2020. Important to note that for many seasons only very few matches are available. Each match in the dataset consists of team metadata (e.g., country, lineups, etc.), competition metadata (e.g., stage, stadium, etc.), and most importantly, manually collected and labeled event data.

Match event data essentially describe on-the-ball actions using rich attributes: action name, location, player name, time, as well as optional parameters as action result and player body part used for executing the action. Detailed documentation is available on the dataset’s Github repository. Overall, the dataset contains about 3M records of events. For more information about football datasets, visit this great review by Christian Kotitschke.

Building the language of football

We define two objectives regarding our representation:

  1. The representation has to embed knowledge about the game.
  2. The representation has to be as intuitive as possible.

The following fundamental principles will guide us in designing the language to describe only what matters, while mimicking the human way of speech, keeping it as simple and intuitive as possible.

Following the human way of speech

People tend to use rich professional language to describe players. When scouting for new players, for instance, we may say: "We need a target man who scores within the box, with excellent finishing on both legs and good heading. He also needs to successfully receive long balls under pressure…". For this profile, we could have in mind players like Eto’o, Ibrahimovic, Suarez, etc.

Interestingly, such complex descriptions may be replaced with the name of a representative player. For example, one may say "find me another Suarez". To this end, we should have a similar representation for players with similar human-level descriptions. So, what aspects do we need to consider when modeling the football language?

  • What a player does – since we use event data, we will merely deal with on-the-ball actions, such as pass, dribble, etc.
  • Where he does it on the pitch – center, flanks, defensive positions, forward positions, etc.
  • How he does it – using which body part, for how long, etc.

We conclude that we need a representation that should capture on-the-ball actions, space, and context.

Why word embeddings?

Before we start modeling, we should always consider a few possible approaches. Three factors have led me to choose word embeddings:

  1. Word embeddings intuitively encode semantic contexts.
  2. Word embeddings are a powerful mathematical tool, producing state-of-the-art results. Yet, they are quite efficient to use.
  3. Many of these models are open source, simple as that.

Can we use Word2Vec & Doc2Vec? Using Word2Vec and Doc2Vec requires validating three main assumptions: (1) sentence ordinality, (2) ability to predict the identity of a missing word given its surroundings, and (3) a well-defined vocabulary. Next, we will address these requirements.

Event data as textual data

Think of the way football is broadcasted on the radio, which is a verbal and concise way to consume it in real life. This can serve as a great inspiration for how to use event data to describe on-the-ball actions.

Event data describe which actions occurred, when, where on the pitch, and by whom. Since these events are ordered in time, we can use them to build a sequence of events, i.e., the sentences in our language.

Next question to address – given a sequence of real events, can we deduce what event is missing? To answer it, consider the following series of actions:

_Pass to flank, cross to box, ? , goalkeeper saves_.

What can be put within the blank space? Any type of shot or header from within the box may fit. While most of you find this riddle quite easy, two important insights are raised:

  1. Knowing the answer is, in a sense, understanding how football works.
  2. All actions that fit this missing word, can be considered semantically similar – the action of taking the ball and directly trying to score a goal from within the box.

Defining a vocabulary

So, how can we encode the events as words in our language? While the options are endless, I decided on the following scheme:

"[]"

  • Actions: pass, dribble, shot, goalkeeper action, interception, ball receipt, carry, dispossession, duel, block, foul, offside, and clearance.
  • Location – I split the pitch into five bins horizontally from left to right, and five vertically from top to bottom, as follows:
Figure 1: Action location notation. Both x and y axes are split into five bins. Angles notation is available on the bottom right. Image by Author.
Figure 1: Action location notation. Both x and y axes are split into five bins. Angles notation is available on the bottom right. Image by Author.
  • Additional arguments – body part, pass height, pass direction, backheel pass, shot technique type, action-outcome, etc.

It is important to understand that argument selection highly affects the results. **** For example, I didn’t consider any parameters relating to the duration of the action, meaning the language will not capture this factor when describing players.

To recap, we will use match-event data to build tokens describing the game of football. Using these words, we will construct sentences, on which we can train the Word2Vec model to understand football actions. Then, we will use the Doc2Vec model to represent football players.

Figure 2: Research plan. Image by Author.
Figure 2: Research plan. Image by Author.

Important disclaimer – the representation meant to describe what, where, and how players perform actions, and not how well they perform them. Until the next post, the skill level is a missing piece in our representation. For example, our model can indicate that a player performs the same actions as Messi, but it doesn’t mean he is as good as Messi. Unfortunately, I don’t believe event data have what is required for valid measurement of player skill levels.

Action2Vec – embedding on-the-ball actions

After we have a vocabulary, we can start building sentences of on-the-ball actions. We group actions by ball possessions, meaning, each sentence is a series of team actions until the ball is lost to the other team. We allow concatenation of the following possessions in case they are shorter than the model’s sampling window parameter.

We use Gensim Word2Vec with the following hyperparameters: window size of 3, embedding size of 32, and minimal count for words in the dataset of 10 appearances. The model is fed with integer encoding of the words we created. Overall, we have ~19K words in our vocabulary. Let’s see two examples of words in our language, their representation, and the closest (semantically) word in the football space.

Example 1: Word ID 3274

  • Word description: _(4/5,3/5):( ^ < )|high-long|leftfoot
  • Meaning: long diagonal high pass from a front midfield position. The diagonal direction follows the notation explained in Figure 2.
  • Word embedding = [0.55 -0.57 … -0.13 0.13] (32 x 1 vector)
Figure 3: Word ID: 3274, word description: _(4/5,3/5):( ^ < )|high-long|leftfoot. Image by Author.
Figure 3: Word ID: 3274, word description: _(4/5,3/5):( ^ < )|high-long|leftfoot. Image by Author.
  • The most similar word is 742: _(4/5,3/5):( ^ < )|ground-long|leftfoot (same pass, but on the ground). The second most similar word corresponds to the same pass but with the other (right) foot. The high similarity between the same passes with different body parts or heights shows that the model learned the semantics behind words’ ids, capturing game logic and flow.

Example 2: Word ID 315

  • Word description: (1/5,1/5): incomplete.
  • Meaning: unsuccessful dribble on the left side of the defense.
  • Representation = [-0.45 -0.82 … 1.1 0.97] (32 x 1 vector).
Figure 4: Word ID 315, word description: (1/5,1/5): incomplete. Image by Author.
Figure 4: Word ID 315, word description: (1/5,1/5): incomplete. Image by Author.
  • Most similar word: (1/5,1/5), meaning the model learned that unsuccessful dribble and dispossession are similar, since both shift the possession of their own team to the rival team, as a result of an action performed by the ball carrier.

Now, let’s see how it looks for the entire vocabulary. To this end, we use UMAP as a technique to reduce dimensionality from a representation in 32 dimensions to just two dimensions.

Figure 5: UMAP projections of the complete 19K words Action2Vec vocabulary. Image by Author.
Figure 5: UMAP projections of the complete 19K words Action2Vec vocabulary. Image by Author.

We can notice the prevalence of passes, colored in purple-blue, which are the majority of the actions. Visually, passes are shaped like a giant octopus, spreading its arms everywhere, as they connect all other actions, players, and positions. Shots words (in pink) are surrounded by crosses, in-box passes, etc. In addition, defensive passes are far away from attacking passes, so in a sense, the location was inferred from the data as well. Having said that, it is important to remember it is merely a 2D-reduction of the representation.

Player2Vec

PlayerMatch2Vec – using words to describe players

If actions are words and a sentence is a series of ordered actions, then a document is usually a collection of subsequent sentences. In our case, a document is a list of subsequent possessions within a specific timeframe- a complete match, a 10-minutes play, etc. However, this common definition of documents describes all team players at once, rather than a single player.

There are several possible approaches to build documents to describe a single-player. I chose a simple solution of defining a player document by an ordered ‘bag-of-actions’ of the player during the defined scope, .i.e, all the player actions within a match. Accordingly, we may have multiple documents describing a player in the data.

To produce a single vector for each player, we can average the vectors over the granularity level, e.g., over all his matches. However, by doing so we eliminate the player’s span in space which may be meaningful as well. To represent players, we used Gensim’s Doc2Vec model with similar parameters as Action2Vec, except for a smaller sampling window size.

Figure 6: UMAP projection of PlayerMatch2vec. Each dot represents a player within a specific match. Players are colored by position. Image by Author.
Figure 6: UMAP projection of PlayerMatch2vec. Each dot represents a player within a specific match. Players are colored by position. Image by Author.

Figure 6 covers all players’ matches in the dataset, colored by the player position. Position acts as a very reasonable ‘explainer’ since players in the same position are usually required to the same actions in similar locations. For example, the goalkeeper cluster is well distinguished and very homogenous. Interestingly, attackers are the closest in space to the goalkeepers as they trigger a goalkeeper save.

Much more heterogeneous clusters are the right-backs and left-backs clusters. Although the two seem disjoint, a deeper dive reveals two main clusters of native defenders, such as Jordi Alba and Benjamin Mendy, and a ‘bridge’ area including defenders like Antonio Valencia and Daniel Wass. All the other positions are located between these three clusters.

I guess you are wondering what is the ‘small arm’ on the right, which seems to be a one-colored cluster. Well, this extraordinary shape belongs to an extraordinary player – Lionel Messi. You are welcome to explore which players are included in Messi’s cluster yourself, but no worries – I will dig into it myself in my next post.

Player2vec – representing each player with a single vector

To produce a single vector for each player, we average the representations of all of its matches. The fascinating result is available for you to explore using the following interactive chart:

Again, we can play with it for hours and find many fancy insights (as well as some I can’t explain), but I rather focus on a single representative example – the inspection of right-wingers:

Figure 8: UMAP projection Player2vec for all right-wingers in the dataset. Image by Author.
Figure 8: UMAP projection Player2vec for all right-wingers in the dataset. Image by Author.

Cluster#1 is a kind of an outlier, which makes sense as it contains non-native wingers in their nature. The highest point on the ‘umap_1’ axis is Lionel Messi. In his cluster, Cluster#2, we can find wingers who tend to cut-to-middle like Mohamed Salah, Xherdan Shakiri, and Hakim Ziyech. Cluster#3 includes players like Rebic, Feghouli & Joaquín (Real Betis) who stick more to the flanks and pass the ball to the box.

How can use it?

Finding similar players

Remember the example of finding new players to sign from the beginning of this article? For me, it matched players like Eto’o, Zlatan Ibrahimovic, and Luis Suarez. So what can we expect when we look for the most similar players by cosine similarity to Suarez according to our representation?

  1. Francesca Kirby
  2. Zlatan Ibrahimovic
  3. Vivianne Miedema
  4. Samuel Eto’o
  5. Thomas Muller
  6. Katie Stengel
  7. David Villa
  8. Cristiano Ronaldo
  9. Alexis Sanchez
  10. Diego Forlan
  11. Nikita Parris

Interestingly, there are two female players in the top three matches, both Kirby and Miedema are successful forwards. When examining only the male players, we get the exact players I originally listed. These results were surprising since I made this list before checking the results (skepticism is allowed). Just imagine what can be achieved if instead of a sparse 4K-top-players dataset, we would have a larger, full, up-to-date database of all active players!

We are about to finish this post, but you just can’t write a post about word embeddings without showing some entertaining semantic analogies. To this end, I used regex to mine our vocabulary and create artificial documents. For example, the dribbling doc __ is a model representation for a document that holds one copy of any successful dribble tokens.

As a "disclaimer-first" kind of a guy, it is important to note that such artificial document representations are stochastic. Additionally, to be represented appropriately, docs’ design and structure should resemble a native document- not a trivial task at all. I will cover it in my next post. Last, I use the approximation sign as a reminder that these are not equations, but rather the most similar document, by cosine similarity, to the query vector.

So, with the right balance between skepticism and anticipation, let the fun begin!

Football analogies

Most similar player to Andres Iniesta: Arthur Melo (Juventus, former Barcelona player)

  • Andres Iniesta + outbox scoring + inbox scoring ~ Kevin De Bruyne
  • Andres Iniesta + outbox scoring – dribbling ~ Toni Kroos
  • Andres Iniesta + inbox scoring ~ Eden Hazard

Most similar player to Neymar: Ronaldinho

  • Neymar – dribbling (all locations) ~ Thierry Henry (in Barcelona)
  • Neymar – flank dribbling ~ Philippe Coutinho

Most similar player to Griezmann: Carlos Vela

  • Griezmann + dribble (all locations) ~ Arjen Robben
  • Griezmann + flank dribble ~ Mikel Oyarzabal

Most similar player to Busquets: Yaya Touré

  • Busquets + dribble ~ Thiago Alcântara

To summarize, in this post we created action embeddings – a representation of football actions; and player embeddings – a representation of on-the-ball actions of football players. In terms of immediate value, we can use the model to find players with a specific style of play. We remember (!) that this is a descriptive representation, not a quality indicator.

Although it may be cool to use it to produce great GIFs, what can we actually achieve by using all this? What is it good for? So, if you have read this far, you are definitely motivated enough, but just to keep everyone sane, we will stop here, and continue in the next posts.

What’s Next?

In the next post, we will deal with explaining the embeddings and players’ similarities. To this end, I will share some practices for embedding explainability and cover the tip of the iceberg of skill evaluation. Last, we will use the wonderful Stramlit & Plotly Python packages to create a stunning UI and interactive visualizations. All code is available via the Football2Vec library on Github. Here is another fancy GIF as a teaser:

I’ll finish with a sincere call to all data-collection companies to release more public datasets. Tracking data can also capture off-ball movements, marking, pressure, velocities, tactics, and much more. We are just experiencing the beginning of the data revolution in football and we ask you to let us enjoy it as well. After all, it is a team sport.


Related Articles