The world’s leading publication for data science, AI, and ML professionals.

Football Coaches Analysis using Python

Here's how I used Data Science to extract insightful information about coaches from the best international championships.

Photo by Nguyen Thu Hoai on Unsplash
Photo by Nguyen Thu Hoai on Unsplash

DISCLAIMER: I’m not a football analyst! I don’t do football analysis for a living. The assumptions that are made in this article are made by a passionate football fan and lover. Don’t make war, make love. ❤

In Italy, football is almost a religion.

I’ve been following my team (A.S. Roma ❤) since I was a child, and I’ve been loving football since then.

Another (apparently unrelated) fun fact is that I work with data. In particular, I work in the Aerospace Engineering and Engineering Mechanics department and my main task is to analyze data and write Machine Learning algorithms every day.

The thing I love the most about statistics is the fact it can be applied to, virtually, all the studies that have repeatability. Football is made of seasons, matches and recurrent events during those matches. For this reason, it is a perfect example of a "statistical study" and football Analytics is a rising field.

In this article, I will perform a statistical study about coaches in the 4 most important leagues in Europe (Bundesliga, Premier League, Serie A, La Liga). I’ll be doing that in the most simple and explainable way, using Python and some basic libraries. Nonetheless, even for a football fan like me, some discoveries were quite interesting and insightful. I hope you’ll find them fun as well. 🙂 Let’s start!

1. Libraries and Data

These are the libraries you need to import:

Optional: If you want to make the plots exactly like they look in the article, import this:

Ok, let’s "talk data" **** now.

1.1 Source and Usability

The dataset is from Kaggle and can be downloaded from this source. The data requires no copyright license to be used (CC0 1.0 Universal (CC0 1.0)) and it can be downloaded and used for free. The dataset is quite new (updated 6 months ago) and can be downloaded in seconds as it is very small (about 100 rows and 13 columns).

1.2 Overview

Let’s give it a look:

For example, let’s print all the coaches of the Premier League:

The appointment_date and contract_until_date columns are not really helpful. Let’s get rid of them. Plus, let’s get rid of the duplicates and of the NaN rows:

Nice, now that we are ready, let’s start with some exploration.

2. Coaches’ Place of Birth

First of all, let’s see how many coaches we have from each League:

Ok, fair. We have almost 20 coaches per Country, that is predictable, but where do they come from?

Let me report the image alone for clarity:

Image by author, generated using the code above
Image by author, generated using the code above

As we can see, most of them comes from Europe, which is predictable. We have a couple of outliers in USA and in Australia, and 6 of them in South America.

One of the two coaches that we see in Oceania is Antoine Kombouaré

One of the 6 coaches in South America is Manuel Pellegrini, coach of Real Betis:

Jesse Marsch is a coach from the USA.

3. Coaches’ terms

We have seen that Jesse Marsch doesn’t really have a team… how many coaches don’t have a club?

So we have a consistent part of the dataset made of coaches that have been sacked. After how many years a coach will be eventually sacked?

To do this analysis let’s use the "_avg_term_as_a_coach_" column. Let’s convert it into float using this code and see how many rows have "problem" with the conversion:

Only one. So it is reasonable to replace the problematic row value with the mean causing little (to none) troubles to the study. Let’s find the mean.

Let’s replace it:

And let’s plot it:

Image by author, generated using the code above
Image by author, generated using the code above

So the average term in Europe is from 1–3 years. We have some outliers around 5–6 and ones that are past 8 years. An interesting thing to consider is the "Average Term as a Coach" per League. In other words, the question is the following:

"What is the average term of a Coach that coaches in Italy/Germany/England/Spain"

Except for a couple of international coaches, a coach tends to move within his league (e.g. Inzaghi that went from Lazio to Inter, or Juric that went from Verona to Torino or Italiano that went from La Spezia to Fiorentina). This means that the average length of a term of a Team is basically the average length of the project. Of course, a team that change the project too often is probably a team that is not winning as many games as they expect, so we can see that as a "bad sign". On the other hand, if a coach has a average term that is longer means that the club where he stays trusts him, and it is again a good sign.

In a few words:

"A long average "average_term_of_a_coach" for a League is one factor that increases the League "wellness" and makes it more challenging"

Let’s plot it using boxplots:

Image by author, generated using the code above
Image by author, generated using the code above

The outliers on the "Germany" league (Bundesliga) makes the plot very difficult to interpret. We know that in Bundesliga there is an outlier that stayed in the club for more that 8 years, but for more clarity, let’s cut it off:

So in Italy, the projects are usually shorter. In England, even if the distribution varies a lot, the projects are, in average longer, and the same is true for France.

Another efficient way to plot the statistics is the following:

Image made by author, using the code above
Image made by author, using the code above

Except for the outliers, the English football (which is notoriously the more challenging one) has longer projects in almost all the quartiles. This kind of confirms our hypothesis.

4. Coaches’ Age

NOTE: The following analysis is made using a code that is very similar to the one above. For this reason, it has not been fully reported.

Another very interesting factor is the Age of the coaches. I am not a fan of the "revolutionary ideas", I think football is, at the very end, a very simple game and doesn’t need any revolution: it is beautiful for the way it is. For this reason I believe that being football expert makes the very big difference. And to be experienced you need to be old enough. It can’t be a coincidence that the best coaches in modern history (Ancelotti, Mourinho or Guardiola) are all past their fifties.

Of course, there have been counter examples like Nagelsmann and Arteta (34 and 40 years old) but, as the following plot shows, they are the left tail of the distribution 🙂

Image belonging to author
Image belonging to author

But it is still interesting to notice that there are some leagues that do prefer to have young coaches, like Germany:

Image belonging to author
Image belonging to author

France is the country with the oldest coaches, even if the maximum value is lower than Italy, England, and Spain. The average value of Spain is still pretty high.

England is in the middle of the trend. In general, it prefers to have younger player (median), but the first quartile is larger than Italy and Germany.

It is important to highlight that the only league that is clearly distinguishable from the other ones is Germany, that has the youngest coaches in all the quartiles.

5. Coaches’ Ideas!

Ok, but what is their actual ways of playing football? How do they place their players in the pitch? Let’s show it:

Image made by author using the code shown above
Image made by author using the code shown above

So:

  1. A lot of them prefer 4 defenders
  2. A smaller part uses 3 defenders
  3. Very few of them (one or two) uses 5 defenders

But a coach formation is nothing less than his idea of playing football. So how do they "see" football in different leagues?

Image made by author using the code above
Image made by author using the code above

So Premier League has the most heterogeneous way of playing football. There are a lot of different formations there. Ligue 1 is the most homogeneous one (only 5 different ways). In Italy 6 coaches use 3 defenders instead of 4 (that is the highest!). La Liga, which is famous to be an attacking way of football, has the most frequent formation that is 4–3–3- attacking. Data Science is awesome guys 🙂

6. Summary

Let’s summarize some of the points we touched.

  1. Most of the European coaches actually come from Europe, some of them from South America, and only a couple of them comes from America.
  2. 15.3% of European coaches can’t (immediately) find another team when it is sacked. In Italy the projects tend to be way shorter than everywhere else, while England has the most long ones.
  3. The distribution of the age of Football coaches is very similar all over Europe (see statistics in Chapter 4) except Germany. In Germany, coaches are considerably younger
  4. Premier League has the most heterogeneous way of playing football, while Ligue 1 has the most homogeneous one. La Liga coaches prefers to have a more attacking formation.

7. Conclusions

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.


Related Articles