The next Milan derby will kick off next week, to provide a bit more information and visualisation to my football fan husband as well as building my python learning, I grabbed the data from Kaggle regarding Serie A and Wikipedia, and tried to develop an evidence-based visualisation.

Prerequisite:
Python packages for this analysis:
import Pandas as pd
import glob
import matplotlib.pyplot as plt
import numpy as np
import Plotly.express as px
Step 1: Get the Data
The data used in the current analysis was from Kaggle (https://www.kaggle.com/massibelloni/serie-a-19932017), from where I downloaded all Serie A data from 1993 to 2017 and saved to local folders.
Step 2: Merge the Data
The Serie A data downloaded from Kaggle were presented by year. To get an overview of Serie A, I merged all data first before filtering AC Milan / Inter Milan results.
Here I used the function glob:
##set up the path
path = r'.....SerieA'
all_files = glob.glob(path + "/*.csv")
##read all files
tb = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
tb.append(df)
##combine all files
frame = pd.concat(tb, axis=0, ignore_index=True)
Step 3: Filter out all matches between AC Milan and Inter Milan
According to the dataframe, it had HomeTeam and Away Team (see Figure 1 below). We need to get AC Milan as the home team and Inter Milan as the away team, and also the other way round.

First, I was trying to get the matches where AC Milan was the home team and Inter Milan was the away team:
df1= frame[(frame['HomeTeam'] == 'Milan') & (frame['AwayTeam'] == 'Inter')]
Then, Inter Milan was the home team and AC Milan was the away team:
df2 = frame[(frame['HomeTeam'] == 'Inter') & (frame['AwayTeam'] == 'Milan')]
To combine these two dataframes into one:
ddff = [df2,df1]
result = pd.concat(ddff)
We are nearly there with the data clean! As I only wanted to know the first few columns of the data, I dropped out the rest columns:
data = result.iloc[:, 0:7]
Step 4: Complete the dataset for visualisation
A few more things I did to make the visualisation easier: 1) add the match data from 1997 to 2020 as those data were missing from the downloaded file; 2) get the "year" of the matches date; 3) add two columns – one as the "winner team", and the other as the "total goals" for each game:
## add data from 1997 to 2020 - a bit messy method was used here:
new_row = {'Date': '15/10/17', 'HomeTeam':'Inter', 'AwayTeam':'Milan', 'FTHG':'3','FTAG':'2', 'FTR': 'H'}
data = data.append(new_row, ignore_index = True)
new_row = {'Date': '27/12/17', 'HomeTeam':'Milan', 'AwayTeam':'Inter', 'FTHG':'1','FTAG':'0', 'FTR': 'H'}
data = data.append(new_row, ignore_index = True)
new_row = {'Date': '04/04/18', 'HomeTeam':'Milan', 'AwayTeam':'Inter', 'FTHG':'0','FTAG':'0', 'FTR': 'D'}
data = data.append(new_row, ignore_index = True)
new_row = {'Date': '21/10/18', 'HomeTeam':'Inter', 'AwayTeam':'Milan', 'FTHG':'1','FTAG':'0', 'FTR': 'H'}
data = data.append(new_row, ignore_index = True)
new_row = {'Date': '17/03/19', 'HomeTeam':'Milan', 'AwayTeam':'Inter', 'FTHG':'2','FTAG':'3', 'FTR': 'A'}
data = data.append(new_row, ignore_index = True)
new_row = {'Date': '09/02/20', 'HomeTeam':'Inter', 'AwayTeam':'Milan', 'FTHG':'4','FTAG':'2', 'FTR': 'H'}
data = data.append(new_row, ignore_index = True)
new_row = {'Date': '18/10/20', 'HomeTeam':'Inter', 'AwayTeam':'Milan', 'FTHG':'1','FTAG':'2', 'FTR': 'A'}
data = data.append(new_row, ignore_index = True)
Prepare the columns for visualisation:
## Add the column - "Year" of each match
data['year'] = pd.DatetimeIndex(data['Date']).year
data = data.sort_values('year')
## Add the column - "Winner" of each match
data['Winner'] = np.where(data['FTR'] == 'A', data['AwayTeam'], (np.where(data['FTR'] == 'H', data['HomeTeam'], 'Draw')))
## Add the column - "TotalGoal" of each match
data["FTHG"] = pd.to_numeric(data["FTHG"])
data["FTAG"] = pd.to_numeric(data["FTAG"])
data['TotalGoal'] = data['FTHG'] + data['FTAG']
Step 5: Visualisation with Plotly
Here I created two charts:
Bubble Chart:
The chart showing the relationship between "match date", "total goals in each match", and the "winner of the game":
fig = px.scatter(data, x="year", y="TotalGoal",size="TotalGoal", color="Winner",
color_discrete_map={'Milan':'red',
'Inter':'royalblue',
'Draw':'lightgreen'},
hover_name="Winner", log_x=True, size_max=20)
fig.show()

Pie Chart:
The other chart is about the proportion of the winning in the past years. To get the pie chart, the dataframe needed to be re-structured with groupby:
## Restructure the dataframe with groupby
counts = data['TotalGoal'].groupby(data['Winner']).count().reset_index(name="count")
## Pie chart
fig1 = px.pie(counts, values ='count', names ='Winner', title='Milan Derby', color = 'Winner',
color_discrete_map={'Milan':'red',
'Inter':'royalblue',
'Draw':'lightgreen'})
fig1.show()

Thoughts
By reading these two charts, here are some thoughts on Milan Derby.
- It is clear that there aren’t big differences between AC Milan and Inter Milan in winning the matches. Inter Milan won slightly more games than AC Milan;
- AC Milan was more likely to win in the matches before 2010 than Inter Milan, while Inter Milan got more chances to win after 2010;
- When the total goals of the matches over 4, Inter Milan seems more likely to win the match.
Those thoughts seem aligned with the teams’ performance in those years. I cannot wait to watch the next derby!