The World Cup is reaching a new stage and few were those who could anticipate the group stage outcomes. Now it’s time for a much more thrilling phase, where the greatest of the world will face each other. The goal of this article is, using the power of Data Science with Python, try to uncover some of the statistics those games will present.
In this post, we will:
- Create a crawler and pull team stats data from the web
- Pre-process, explore and visualize that data
- Download another type of data with match results
- Combine those two datasets and build one model to predict match results
- Create a simple Monte Carlo Simulation and get winner odds for the 2018 World Cup – Knockout Stage
Some libraries we’re going to use:
- Pandas
- Numpy
- Sklearn
- Plotly
The idea here is to make a Machine Learning algorithm to predict the winner of a single match, and from there, build a monte carlo simulation that could infer the odds of each knockout game winner, and subsequently, the probability for the world’s champion.
This article will present some graphs and codes, but feel free to skip it if you’d like, I’ll try to make it as intuitive as possible.
Our Strategy
Most of the game simulators tend to use an overall number that represents a team’s performance. Here we are trying a different approach, working not only with the overall, but three other values (Attack, Defense, Mid-Side) altogether in a more sophisticated manner, to avoid simply converging all features to a single factor which dictates the team’s strength.
The model will be built on top of Sklearn library, using Pandas dataframes to manipulate data in tables, and Plotly to visualize some interesting features.
Getting The Data
So, the first step to obtain the data is to make a little crawler and grab information from Fifa Index, a great source to pull international team stats back from 2004. Here’s how the tables are disposed on the website:

With tables displayed that way, it’s quite easy to scrape a website like this. For that I used Beautiful Soup library, to access the HTML code, and Pandas read_html function to transform it into a readable dataframe.
I must admit this crawler ended up a bit lazy and it might take in some duplicates to our dataset. No concerns, though, since we can just drop those dupes with Pandas later on (I’ll also provide some links for raw datasets during this article).
I won’t get into the very details of how this scrapper was built, but the code will be left below if you’d like to check it out. In case you have additional interest, don’t hesitate in reaching me out.
web_address = 'https://www.fifaindex.com/teams/fifa'
df = pd.DataFrame()
for day in range(1,260,1):
for pag in range(1,30):
source_address = web_address + '05_' + str(day) + '/' + str(pag) + '/' + '?type=1'
print('Day:', str(day))
print(pag)
try:
soup = get_soup(source_address)
result_list = soup.find('div', {'id': 'no-more-tables'})
except:
print('Page not found.')
break
date = str(soup.find('ol', {'class': 'breadcrumb'}))
if df.empty:
df = pd.read_html(str(result_list))[0]
df['date'] = date
else:
temp_df = pd.read_html(str(result_list))[0]
temp_df['date'] = date
df = df.append(temp_df)
After messing up with the raw dataset, I saved a cleaner version (using Pickle) of it and we’re going to use this dataset as a starting point.
df = read_pickle('team_stats.pickle')
df.head()

Exploring And Visualizing
This table contains Fifa Ratings for many international teams from 2004 to 2018. The original data includes months and days as well, but to keep it simple I’ve averaged the performance of each team by year, so that we have few data points to handle. Let’s check out the teams overall score by year in a scatter chart using Plotly:

Plotly has amazing interactive charts that actually can display information when you hover over it. This one was not very informative, though. Let’s try to check the best performing teams for each date in a bar chart and see how they changed during the years. The plot below shows the best rated teams by year and the mean performance of all teams (as in the white line).

That’s a nice chart! Spain has been on top for a long period. Let’s now open the CSV file containing match results information of international teams:
results = pd.read_csv('match_results.csv')
results.head()

The results dataset was pulled from github, and consists of football match results back from 1872, with teams, scores and some other information. Let’s clean it up and keep just the useful features to our purpose.
results = results.drop(['city', 'tournament', 'country'], axis=1)
results.home_team = results.home_team.apply(text_norm) #lower_case
results.away_team = results.away_team.apply(text_norm) #lower_case
results.index = pd.DatetimeIndex(results.date).year
results = results.drop('date', 1)
results.head()

Three steps will be important for us to make use of this data:
- First, as we only have stats data for 2004 on, we should get rid of all those other years (unfortunately).
- Second, we should not be working with teams that aren’t in our main dataframe, since we don’t have a rating score for them.
- Finally we have to work around this home-away team situation, since for the World Cup predictions we are going to consider all stadiums being neutral.
results = results.loc[2004:2017]
df_teams = list(df.name.unique())
results = results.reset_index()
for index, row in results.iterrows():
if row.home_team not in df_teams:
results.loc[index, 'home_team'] = None
if row.away_team not in df_teams:
results.loc[index, 'away_team'] = None
results = results.dropna()
Let’s now input the teams stats into the results dataframe. We are going to do this by creating 8 new columns, representing 4 skills for each team (winner and loser). This is what we come up with by now:

Now we need to subtract home_score from away_score, so that we have only one factor representing which team wins (negative or positive number of goals). Notice that I’m using the terms "home" and "away" simply because this is the way the original dataset came, but this will not be of any influence in our analysis, since we’ll work only with neutral games in World Cup matches. From now on I’ll just call them Team 1 and Team 2 (being Team 1 always the first to appear in our dataset from left to right).
So let’s compact the number of goals of each team in a single feature and create a winner column, which will be the target to forecast. If winner > 0, that means that Team 1 wins, < 0 means that Team 2 wins. We’ll also discard data points where there’s a draw, because there is always winner on knockout matches.
results['score'] = results.home_score - results.away_score
results = results.drop(['home_score', 'away_score', 'home_team', 'away_team'], 1)
results['winner'] = None
results['winner'][results.score > 0] = 1
results['winner'][results.score < 0] = -1
results['winner'][results.score == 0] = 0
results = results[results.winner != 0]
To simplify the problem even more, I’m going to "help" the model extract information from the data, providing the difference between skills, instead of working with the performance itself. So, for instance, attack (att) ** is going to be Team’s 1 attack minus Team’s 2 attac**k.
results['att'] = results['att1'] - results['att2']
results['def'] = results['def1'] - results['def2']
results['mid'] = results['mid1'] - results['mid2']
results['ovr'] = results['ovr1'] - results['ovr2']
to_drop = results[results.winner == 1].sample(247)
results = results.drop(labels=to_drop.index, axis=0)
That’s how our data looks like so far:

Let’s check the correlation between Team 1’s overall and score:

That’s good for us. You can see a positive correlation between these features, which indicates that our data appears to make sense (the higher the team’s overall, we should expect a higher number of goals).
Creating The Model
For applying Machine Learning, I’ve built a little function to prepare the data, remove unecessary attributes, convert it to array format using Numpy and split it into training and testing sets, so that we can evaluate our model’s precision.
In this example, I treated the context as a classification problem, in which our targets are:
- 1: Team 1 wins
- -1: Team 2 wins
The classifier algorithms used were a Logistic Regression, a Random Forest Classifier and a Linear Support Vector Classifier:
lr = LogisticRegression()
lr.fit(x_train, y_train)
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
svc = SVC(kernel='linear')
svc.fit(x_train, y_train)
And here are the accuracy score for each one of these models:
LR → 68.4%
RF → 67.9%
SVC → 70.1%
Here, SVC seems to perform better than the others right out of the box, let’s check it’s performance for each class independently:

We can observe a little more accuracy on Team 1. That may be related to the fact that in the original dataset, this team is the home team, and maybe it had some more victories, what could have affected the behavior of the model somehow. But apart from that, this could as well be simply random, so I’ll just move on.
Ok. We have a predictor that can guess with 70% accuracy which team is going to win. Enough for building a simulation. Shall we?
Simulating Matches
First thing we need is performance data for world cup teams that passed group stage. We’ll build a scraper similar to the one we built for the other years, but now using 2018 World Cup stats from Fifa Index.
Unfortunately it looks like this data does not differ much from 2018 (non World Cup) data, since Germany still occupies the second place in the ranking, while in reality it’s already out of the competition. Anyway we’ll stick to this source to gather our data.
wc = read_pickle('world_cup_teams.pickle')
wc.head()

That’s great. We are only a few steps from having a simulator right now. We’ll use our Machine Learning model as the rule for a Monte Carlo Simulation. If you’ve had few or no contact with Monte Carlo, I recommend this course from MIT Open Course Ware to begin.
Let’s build a function that compares teams abilities, evaluate winner (1 or -1 for Team 1 or Team 2, respectively) with our SVC model, and returns the winners name.
def match(wc, team1, team2, model):
match = pd.DataFrame(columns=['att1','def1','mid1','ovr1','att2','def2','mid2','ovr2'], index=[0])
match['att1'] = wc[wc.name == team1]['att'].iloc[0]
match['def1'] = wc[wc.name == team1]['def'].iloc[0]
match['mid1'] = wc[wc.name == team1]['mid'].iloc[0]
match['ovr1'] = wc[wc.name == team1]['ovr'].iloc[0]
match['att2'] = wc[wc.name == team2]['att'].iloc[0]
match['def2'] = wc[wc.name == team2]['def'].iloc[0]
match['mid2'] = wc[wc.name == team2]['mid'].iloc[0]
match['ovr2'] = wc[wc.name == team2]['ovr'].iloc[0]
match['att'] = match['att1'] - match['att2']
match['def'] = match['def1'] - match['def2']
match['mid'] = match['mid1'] - match['mid2']
match['ovr'] = match['ovr1'] - match['ovr2']
match = match[['att', 'def', 'mid', 'ovr']]
match_array = match.values
prediction = model.predict(match_array)
winner = None
if prediction == 1:
winner = team1
elif prediction == -1:
winner = team2
return winner
Hmm… How will Brazil perform against Spain?
match(wc, 'brazil', 'spain', svc)
>>> 'spain'
Ohh no. Didn’t expect that sad end for us! But there’s a major problem here, teams performance vary a lot, and the first stage of this World Cup proves it. So let’s add some randomness to it, to avoid results ending up the same every single time we run a simulation.
def match(wc, team1, team2, model, random_scale=5):
match = pd.DataFrame(columns=['att1','def1','mid1','ovr1','att2','def2','mid2','ovr2'], index=[0])
att1 = wc[wc.name == team1]['att'].iloc[0]
def1 = wc[wc.name == team1]['def'].iloc[0]
mid1 = wc[wc.name == team1]['mid'].iloc[0]
ovr1 = wc[wc.name == team1]['ovr'].iloc[0]
att2 = wc[wc.name == team2]['att'].iloc[0]
def2 = wc[wc.name == team2]['def'].iloc[0]
mid2 = wc[wc.name == team2]['mid'].iloc[0]
ovr2 = wc[wc.name == team2]['ovr'].iloc[0]
match['att1'] = np.random.normal(att1, scale=random_scale)
match['def1'] = np.random.normal(def1, scale=random_scale)
match['mid1'] = np.random.normal(mid1, scale=random_scale)
match['ovr1'] = np.random.normal(ovr1, scale=random_scale)
match['att2'] = np.random.normal(att2, scale=random_scale)
match['def2'] = np.random.normal(def2, scale=random_scale)
match['mid2'] = np.random.normal(mid2, scale=random_scale)
match['ovr2'] = np.random.normal(ovr2, scale=random_scale)
match['att'] = match['att1'] - match['att2']
match['def'] = match['def1'] - match['def2']
match['mid'] = match['mid1'] - match['mid2']
match['ovr'] = match['ovr1'] - match['ovr2']
match = match[['att', 'def', 'mid', 'ovr']]
match_array = match.values
prediction = model.predict(match_array)
winner = None
if prediction == 1:
winner = team1
elif prediction == -1:
winner = team2
return winner
Here, random_scale will be the factor that determines how much randomness we want to apply to a team’s performance.
Next and last step is to create a function that runs the match function a bunch of times and computes the probabilities of victory for each of the teams.
def simulate_matches(team1, team2, n_matches=10000):
match_results = []
for i in range(n_matches):
match_results.append(match(wc, team1, team2, svc, random_scale=5))
team1_proba = match_results.count(team1)/len(match_results)*100
team2_proba = match_results.count(team2)/len(match_results)*100
print(team1, str(round(team1_proba, 2)) + '%')
print(team2, str(round(team2_proba,2)) + '%')
print('-------------------------')
print()
if team1_proba > team2_proba:
overall_winner = team1
else:
overall_winner = team2
return {'team1': team1,
'team2': team2,
'team1_proba': team1_proba,
'team2_proba': team2_proba,
'overall_winner': overall_winner,
'match_results': match_results}
Let’s check out how hard will be for Croatia to beat Denmark next sunday:
simulation_test = simulate_matches('croatia', 'denmark', n_matches=10000)
Croatia: 40.62% Denmark: 59.38%
Ok, here you see that the model estimates Denmark beating Croatia with a higher probability, and that may be due to the fact that Fifa Index dataset is probably not considering Croatia’s performance after the beginning of the World Cup.
Let’s measure the difference between both teams probabilities as we run more and more simulations:
p_list = []
for i in range(len(simulation_test['match_results'])):
denmark = simulation_test['match_results'][:i].count('denmark') / (i+1) * 100
croatia = simulation_test['match_results'][:i].count('croatia') / (i+1) * 100
p_list.append(denmark - croatia)

We can see that winner probabilities stabilize around 8 thousand match simulations, so that’s the value we’re going to use. Let’s build the championship tree:

And that’s it. A full simulation of the 2018 World Cup – Knockout Stage!
Considering that our model has an error of 30%, let’s calculate the chance of Spain beating all teams and actually becoming champion:
spain_proba = 0.817 * 0.862 * 0.718 * 0.593 * 100 * (0.7 ** 4)
print('Chance of Spain winning:', str(round(spain_proba,2)) + '%')
Chance of Spain winning: 7.2%
Well, that seems quite fair to me. Just still hoping for Brazil to prove it wrong!
That’s all for this post. This article was written to show how Machine Learning could be used to calculate probabilities in a simulation and does not attempt to actually get the results right, since the data used is not enough for it (or maybe the event itself is simply not predictable). Please, take this just as a tutorial, using World Cup matches just because it’s a cool and up to date subject. A much deeper approach should’ve been used to sharpen the results and make them meaningful at any level!
Some steps that could be incremented to improve the model are:
- Trying more complex Machine Learning algorithms and fine tuning hyper parameters
- Gathering more data, not only international, but also national teams data for training
- Going even further and making a model based on player Statistics, this article has a great introduction to this approach
This is my first post, so thank you if you kept reading until the end. I’ll try to be posting nice materials related to data science and machine learning every month. Feel free to let any comments or concerns, and if you enjoyed it, don’t forget to clap! Thank you, and see you next post.