FIFA World Cup 2018: A Data-Driven Approach to Ideal Team Line-Ups

James Le
Towards Data Science
9 min readMay 14, 2018

--

With the World Cup 2018 coming up this summer in Russia, every soccer fan around the world is eager to make his prediction on what team will win this year. Another looming question for the fans is how their favorite national teams should line up: What formation should be used? Which players should be chosen? Which ones should be left on the bench or eliminated from the tournament?

An enthusiastic soccer fan myself, I started thinking: Why shouldn’t I build my own dream formation for my favorite teams at the World Cup? As someone who loves data science and have grown up playing FIFA, it came to my realization that I can use the data from EA Sport’s extremely popular FIFA18 video game released last year to do my analysis.

In this blog post, I’ll walk you through the step-by-step approach I used to build the most formidable line-up for the 8 acknowledged best teams in this tournament: France, Germany, Spain, England, Brazil, Argentina, Belgium, and Portugal.

The FIFA18 Dataset

I found the FIFA18 Dataset on Kaggle. The dataset contains 17,000+ players featuring in FIFA18, each with more than 70 attributes. It is scraped from the website SoFiFa by extracting the players’ personal data, ID, playing & style statistics. There are various interesting features such as player value, wage, age, and performance rating that I really want to dig into.

After loading the data, I chose only the most interesting columns I want to analyze:

interesting_columns = [
'Name',
'Age',
'Nationality',
'Overall',
'Potential',
'Club',
'Value',
'Wage',
'Preferred Positions'
]
FIFA18 = pd.DataFrame(FIFA18, columns=interesting_columns)

Here’s a quick look at the top 5 players based on overall rating from the dataset:

Data Visualization

In order to give you a good sense of this dataset, I did several visualization of the players’ age, overall, preferred position, nationality, value, and wage. Let’s check each of them out:

As you can see, the majority of the players are between 20 and 26 years old, with the peak at age 25.

This plot shows a normal distribution, with the average overall rating of 66.

Here, the 4 most common preferred positions of the players are Center Back, Striker, Goalkeeper, and Center Midfielder, in that respective order.

I used the package plot.ly to plot a geographic visualization of the players’ nationalities. As you can see, the players are very centralized in Europe. To be precise, England, Germany, Spain, and France.

In terms of value, I made a scatter chart of the players’ value with respect to their age and overall rating. The peak values seem to fall accordingly to the age range 28–33 and the overall rating of 90+.

In terms of wage, I also made a scatter chart of the players’ wage with respect to their age and overall rating. The peak wages seem to fall accordingly to the age range 30–33 and the overall rating of 90+.

Best Squad Analysis

Alright, let’s build some optimal formations for the national teams. For simplicity of this analysis, I only pull in data I am interested in:

FIFA18 = FIFA18[['Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Position', 'Value', 'Wage']]
FIFA18.head(10)

I wrote 2 very important functions, get_best_squad_n (which, given a squad formation and the players’ nationalities, returns a squad with the best players in their respective positions according purely on the overall rating) and get_summary_n (which, given a list of squad formation choices and the national team, compares these different formations based on the average overall rating of the players in those respective formations).

I also made the squad choices more strict:

squad_343_strict = ['GK', 'CB', 'CB', 'CB', 'RB|RWB', 'CM|CDM', 'CM|CDM', 'LB|LWB', 'RM|RW', 'ST|CF', 'LM|LW']squad_442_strict = ['GK', 'RB|RWB', 'CB', 'CB', 'LB|LWB', 'RM', 'CM|CDM', 'CM|CAM', 'LM', 'ST|CF', 'ST|CF']squad_4312_strict = ['GK', 'RB|RWB', 'CB', 'CB', 'LB|LWB', 'CM|CDM', 'CM|CAM|CDM', 'CM|CAM|CDM', 'CAM|CF', 'ST|CF', 'ST|CF']squad_433_strict = ['GK', 'RB|RWB', 'CB', 'CB', 'LB|LWB', 'CM|CDM', 'CM|CAM|CDM', 'CM|CAM|CDM', 'RM|RW', 'ST|CF', 'LM|LW']squad_4231_strict = ['GK', 'RB|RWB', 'CB', 'CB', 'LB|LWB', 'CM|CDM', 'CM|CDM', 'RM|RW', 'CAM', 'LM|LW', 'ST|CF']squad_list = [squad_343_strict, squad_442_strict, squad_4312_strict, squad_433_strict, squad_4231_strict]squad_name = ['3-4-3', '4-4-2', '4-3-1-2', '4-3-3', '4-2-3-1']

1 — France

Let’s explore different squad possibility of France and how it affects the ratings.

France = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['France'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall']) France.set_index('Nationality', inplace = True) France['Overall'] = France['Overall'].astype(float)  print (France)

Let’s check out the best 11 French players in a 4–3–3 lineup.

rating_433_FR_Overall, best_list_433_FR_Overall = get_best_squad_n(squad_433_strict, 'France', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_433_FR_Overall))
print(best_list_433_FR_Overall)
Antoine Griezmann

2 — Germany

The holding champion is certainly a heavy candidate for this year’s 1st place.

Germany = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Germany'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])Germany.set_index('Nationality', inplace = True)Germany['Overall'] = Germany['Overall'].astype(float)print (Germany)

As you can see, Germany’s current ratings peak with either 3–4–3 or 4–3–3 formation. I’ll go ahead with a 4–3–3 option.

rating_433_GER_Overall, best_list_433_GER_Overall = get_best_squad_n(squad_433_strict, 'Germany', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_433_GER_Overall))
print(best_list_433_GER_Overall)
Toni Kroos

3 — Spain

How about our 2010’s winner?

Spain = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Spain'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])Spain.set_index('Nationality', inplace = True)Spain['Overall'] = Spain['Overall'].astype(float)print (Spain)

Well, Spain does best with either 4–3–3 or 4–2–3–1. For the sake of diversity, I’ll choose the 4–2–3–1 formation.

rating_4231_ESP_Overall, best_list_4231_ESP_Overall = get_best_squad_n(squad_4231_strict, 'Spain', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_4231_ESP_Overall))
print(best_list_4231_ESP_Overall)
Sergio Ramos

4 — England

Despite having the best soccer league in Europe, England does not seem to do that well at the national level. Let’s figure out their options for the upcoming World Cup:

England = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['England'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])England.set_index('Nationality', inplace = True)England['Overall'] = England['Overall'].astype(float)print (England)

England should stick to 4–3–3 then.

rating_433_ENG_Overall, best_list_433_ENG_Overall = get_best_squad_n(squad_433_strict, 'England', 'Overall') print('-Overall-') 
print('Average rating: {:.1f}'.format(rating_433_ENG_Overall)) print(best_list_433_ENG_Overall)
Harry Kane

5 — Brazil

Having won the World Cup the most times in history, the Samba team will no doubt be one of the top candidates for this summer in Russia.

Brazil = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Brazil'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])Brazil.set_index('Nationality', inplace = True)Brazil['Overall'] = Brazil['Overall'].astype(float)print (Brazil)

As you can see, Brazil has similar options like England. 4–3–3 all the way.

rating_433_BRA_Overall, best_list_433_BRA_Overall = get_best_squad_n(squad_433_strict, 'Brazil', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_433_BRA_Overall))
print(best_list_433_BRA_Overall)
Neymar

6 — Argentina

Lionel Messi is still waiting for the only trophy he hasn’t gotten yet in his career. Can he carry Argentina to the top after going short in the final 4 years ago?

Argentina = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Argentina'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])Argentina.set_index('Nationality', inplace = True)Argentina['Overall'] = Argentina['Overall'].astype(float)print (Argentina)

Both 3–4–3 and 4–3–3 fare very well for the Argentine players. I’ll choose 3–4–3.

rating_343_ARG_Overall, best_list_343_ARG_Overall = get_best_squad_n(squad_343_strict, 'Argentina', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_343_ARG_Overall))
print(best_list_343_ARG_Overall)
Lionel Messi

7 — Belgium

The Red Devils has some of the best players in English Premier League, but can’t never seem to make it far in the national level. Can Hazard and De Bruyne drive them far this time?

Belgium = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Belgium'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])Belgium.set_index('Nationality', inplace = True)Belgium['Overall'] = Belgium['Overall'].astype(float)print (Belgium)

Again, 4–3–3 is the best formation for Belgium.

rating_433_BEL_Overall, best_list_433_BEL_Overall = get_best_squad_n(squad_433_strict, 'Belgium', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_433_BEL_Overall))
print(best_list_433_BEL_Overall)
Kevin De Bruyne

8 — Portugal

The winner of Euro 2016 and the best player in the world 3 times in a row, Cristiano Ronaldo, has a real chance in this tournament as well.

Portugal = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Portugal'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall'])Portugal.set_index('Nationality', inplace = True)Portugal['Overall'] = Portugal['Overall'].astype(float)print (Portugal)

OK, I’ll go with 4–2–3–1 for Portugal.

rating_4231_POR_Overall, best_list_4231_POR_Overall = get_best_squad_n(squad_4231_strict, 'Portugal', 'Overall')print('-Overall-')
print('Average rating: {:.1f}'.format(rating_4231_POR_Overall))
print(best_list_4231_POR_Overall)
Cristiano Ronaldo

9 — Uruguay

Uruguay has the 2 best strikers in Europe: Suarez & Cavani. Can they perform this time?

Uruguay = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Uruguay'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall', 'Potential'])Uruguay.set_index('Nationality', inplace = True)Uruguay[['Overall', 'Potential']] = Uruguay[['Overall', 'Potential']].astype(float)

print (Uruguay)

Fantastic, Uruguay does best with a 4–3–1–2.

Edison Cavani

10 — Croatia

Well, I’m a big fan of Modric and Rakitic. Needless to say about their winning habits?

Croatia = pd.DataFrame(np.array(get_summary_n(squad_list, squad_name, ['Croatia'])).reshape(-1,4), columns = ['Nationality', 'Squad', 'Overall', 'Potential'])Croatia.set_index('Nationality', inplace = True)Croatia[['Overall', 'Potential']] = Croatia[['Overall', 'Potential']].astype(float)

print (Croatia)

Dope, Croatia is superior with a 4–2–3–1.

Luka Modric

Final Comparison

Ok, let’s make some comparison between these 10 line-ups with the current rating of players for these strongest contenders for World Cup 2018.

So based purely on the FIFA 18 Data:

  • Spain has the highest average overall rating, followed by Germany and Brazil.
  • Germany has the highest total value, followed by Spain and France.
  • Spain has the highest average wage, followed by Germany and Brazil.

My bet is for a Spain vs France in the final, and Brazil vs Germany for the 3rd place. And Les Bleus will win it all! What are your thoughts?

You can view all the source code in my GitHub repo at this link (https://github.com/khanhnamle1994/fifa18). Let me know if you have any questions or suggestions on improvement!

— —

If you enjoyed this piece, I’d love it if you hit the clap button 👏 so others might stumble upon it. You can find my own code on GitHub, and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!

--

--