The world’s leading publication for data science, AI, and ML professionals.

Going for Gold: Predicting medal outcomes in the Olympics using Generalized Linear Modeling

A fun introduction to GLMs using an Olympics dataset

Can we build a generalized linear model of an athlete’s chances of winning a medal?

Photo by Kyle Dias on Unsplash
Photo by Kyle Dias on Unsplash

Like many, I’ve been enjoying the Tokyo Olympics from the comfort of my home. Being the data scientist I am, while watching some of the tennis matches, I fiddled with a Kaggle dataset of Olympians, listing such features as their age, sex, and nationality, and what medal (if any) each athlete won, and wanted to see if I could build a Generalized Linear Model of the chances of an athlete winning a medal based on these features.

Motivation

Based on pertinent features, such as age, height, and nationality, can we predict the likelihood of an individual athlete winning a medal in an Olympic sport?

Dataset

We will use a comprehensive dataset from Kaggle on the Olympic games – both summer and winter – spanning from Athens 1896 to Rio 2016. The following features are measured for each athlete:

  • ID (unique identifier for each athlete to account for duplicate names)
  • Name
  • Sex (F for female, M for male)
  • Age
  • Height (in cm)
  • Weight (in kg)
  • Team
  • NOC – three-letter code for each country established by the National Olympics Committee (e.g., USA: United States of America)
  • Games (e.g., 2008 Summer, 2010 Winter)
  • Year
  • Season (Summer or Winter)
  • City (e.g., Tokyo)
  • Sport (e.g., Tennis)
  • Event (e.g., Tennis Women’s Singles)
  • Medal (Gold, Silver, Bronze, NA)

Initial Data Exploration

First let’s load in our dataset:

Image by Author
Image by Author

Let’s see which country has accumulated the most gold medals:

USA    1131
URS     471
GER     319
GBR     289
FRA     264
       ... 
HAI       1
FIJ       1
PER       1
HKG       1
SYR       1
Name: NOC, Length: 109, dtype: int64

We see that the USA has taken home the most gold, with a whopping 1131 medals as of 2016, which makes sense as it has competed in the Olympics since its inception in 1896.

Let’s see which sport the USA has earned the most medals in:

Athletics                    344
Swimming                     246
Shooting                      54
Wrestling                     52
Boxing                        50
Diving                        48
Gymnastics                    35
Rowing                        33
Speed Skating                 29
Basketball                    23
Tennis                        21
Sailing                       19
Weightlifting                 16
Cycling                       16
Alpine Skiing                 16
Figure Skating                15
Archery                       14
Equestrianism                 11
Snowboarding                  10
Freestyle Skiing               8
Bobsleigh                      7
Beach Volleyball               6
Synchronized Swimming          5
Canoeing                       5
Short Track Speed Skating      4
Fencing                        4
Football                       4
Art Competitions               4
Skeleton                       3
Golf                           3
Softball                       3
Ice Hockey                     3
Water Polo                     3
Rugby                          3
Volleyball                     3
Judo                           2
Taekwondo                      2
Nordic Combined                1
Jeu De Paume                   1
Baseball                       1
Triathlon                      1
Roque                          1
Tug-Of-War                     1
Polo                           1
Name: Sport, dtype: int64

So it looks like the USA has dominated in athletics with 344 gold medals, followed by swimming with 246 gold models. However, I should note that the actual number of gold medals in athletics is 335 as of the Rio games in 2016. Most of the remaining sports have the correct stats as of 2018 when this data was put together, however, there are some slight discrepancies in how medals have been counted in the past. For example, in the case of Fencing, which the data indicates the USA has won 4 medals, some sources report 3 instead. This is due to a discrepancy stemming from 1904 where the IOC counted American Albertson Van Zo Post as representing Cuba instead of the United States. In this dataset, the American representation is assumed. I presume there were similar disputes in the early years of the Olympics for athletics, which would explain the slightly higher number, so I would encourage the reader to cross-validate with the official Olympic sources for your own analyses with this dataset before drawing conclusions. Nevertheless, the majority of these above statistics are correct, such as swimming and tennis.

We can divide the sports by which season they occurred in: winter or summer. Since athletics and swimming are summer sports, let’s see which winter sport the USA has amassed the most gold medals in:

Speed Skating                29
Alpine Skiing                16
Figure Skating               14
Snowboarding                 10
Freestyle Skiing              8
Short Track Speed Skating     4
Skeleton                      3
Ice Hockey                    3
Nordic Combined               1
Name: Sport, dtype: int64

The USA takes home the most gold in speed skating, with 29 medals, but it seems like summer is where the USA really shines. Let’s see which country has accrued the most gold in the winter games.

NOR    111
USA     96
GER     86
URS     77
CAN     62
AUT     59
SUI     50
SWE     50
RUS     49
FIN     42
GDR     39
ITA     37
NED     37
FRA     31
KOR     26
CHN     12
FRG     11
GBR     11
JPN     10
EUN      9
CZE      7
BLR      6
AUS      6
POL      6
CRO      4
EST      4
SVK      2
SLO      2
UKR      2
LIE      2
TCH      2
ESP      1
KAZ      1
IND      1
BEL      1
NEP      1
UZB      1
BUL      1
Name: NOC, dtype: int64

NOR is the NOC code for Norway, which leads in the winter games with 111 gold medals according to this dataset. However, the official tally is 118 as of the Sochi 2014 games (which is the most recent winter game in this dataset), with some missing data from a few years such as 1992 and 1994. Nevertheless, like athletics, this isn’t a huge deviation.

We can then see which sport Norway has dominated in the winter games:

Cross Country Skiing    33
Speed Skating           25
Biathlon                15
Nordic Combined         13
Alpine Skiing           10
Ski Jumping              9
Figure Skating           3
Freestyle Skiing         2
Curling                  1
Name: Sport, dtype: int64

So it turns out cross country skiing is where Norway excels, with 33 gold medals.

We can visualize these medal distributions using the crosstabfunction from pandas

Image by Author
Image by Author

And compare to the USA

Image by Author
Image by Author

Note that not every winter sport is represented as the USA hasn’t won a medal in every winter sport. For example, Biathlon is not present, unlike in the Norwegian chart where there are also a few sports not present.

Generalized Linear Modeling

Now that we’ve done some basic summary statistics, let’s see if we can build a model to predict the chances of an individual athlete (not a team/country) winning a medal. There are various ways to do this, such as training a neural network, but in this article, we will focus on generalized linear models (GLM).

Recall that linear regression models a continuous variable y ~ N(μ, _σ_²) by a linear combination of some explanatory variables X. For example, we could predict the weight of an Olympic swimmer based on their height and sex as follows

Where the β‘s are coefficients for each feature of the _i_th athlete – in this case, height and sex – and ϵ is the error term of the residuals. This can be more generally described as:

For a more in-depth discussion of linear regression, see [here](https://medium.com/analytics-vidhya/understanding-the-linear-regression-808c1f6941c0) and here.

Linear regression models like these satisfy three key properties:

  1. They have a random component y – our dependent variable – which follows a normal distribution.
  2. Linear predictor defined as above that is more generally called η
  3. A link function g that connects the linear predictor to the mean μ In the case of the linear model, g is an identity link function g, so g(ηᵢ)=ηᵢ= μᵢ.

Now suppose we want to predict the number of medals each country has won since the Olympics. Because our y variable counts the number of medals, it is a discrete variable, which violates the assumption of continuous variables under the normal distribution that’s used in linear regression. In this circumstance, a more appropriate model would be a Poisson regression model. This is an example of a generalized linear model (GLM).

GLMs are generalized in the sense that while there is still a linear model at the root of it all, the random component y can be any member of the exponential family of distributions (e.g., Normal, Gamma, Poisson, etc), and the link function can be any function so long as it’s smooth and invertible.

Image by Author
Image by Author

Suppose we want to model whether or not an athlete won a medal: Bronze, Silver, or Gold. Because this is a binary outcome, our response variable y should follow a Bernoulli distribution y ~ Bin(1,p)=Ber(p), and the link function of our linear predictor is defined as follows:

where

The inverse logit function is called the logistic function, which models the probability of winning a medal based on the linear predictor. Since usually n = 1 (hence giving us μᵢ= pᵢ), this model is called logistic regression.

We will use swimming as an example for our GLM with a logistic link function:

Note that a couple of the height and weight entries are NaNs, so we will remove those, as well as age for good measure:

We will also model our response variable, Medal, as 0 for NA and 1 for Gold, Silver or Bronze

Lastly we will codify Sex as 0 for Male and 1 for Female:

Now we’re ready to build our model. We will use the statsmodels and patsy packages which you can install using pip. statsmodels contains the functions we will need to fit our GLM, while patsy is a user-friendly library for building statistical models that we can fit using the former package. Let’s start off with something simple and model the relationship between age and medal outcomes. We’ll first install the packages and run the following code:

pip install statsmodels
pip install patsy

Running this code, we obtain the following output:

Image by Author
Image by Author

There’s a lot to take in with this output, but the main items of interest I’d like you to focus on are the two rows, Intercept and Age, as well as Deviance. For each of the former two rows, they list the estimated coefficient beta for that parameter, the standard error, the z-score, the p-value indicating the significance of this predictor in the model, and a confidence interval of the estimated beta coefficient.

Deviance is a generalization of the residual sum of squares that quantifies the goodness of fit for GLMs. Ideally, we want this metric to be smaller than n-p, where n is the number of observations (18776), and p is the number of parameters in our model (2). Our Deviance in thus 14645 < 18776–2 =18774. So the model passes the goodness of fit test, but let’s try to visualize it. statsmodelsdoesn’t have an intuitive plotting library for GLMs, but we can easily code it using the matplotlibplotting library, plotting the fitted probabilities against age.

Image by Author
Image by Author

This doesn’t really give us a good fit, which makes sense as age wouldn’t necessarily be associated with increased rate of performance. Let’s try adding in additional features, like height, weight, and sex.

Plotting according to height, we obtain a slightly better fit, as evidenced by the drop in Deviance from 14645 to 14202, but it’s still no where near effective in giving a robust fit to our data. Age isn’t a significant predictor with a high p-value, which isn’t surprising from our model of just age.

Recall that the USA’s 2nd best sport was swimming, so we can add the USA as an explanatory variable. Unlike the binary categorization of sex in this dataset (which doesn’t include mention of any nonbinary athletes), countries (NOC) is a categorical variable that spans more than two classes, so we need to modify it in such a way that we have a column specifying whether an individual athlete is representing the USA or not. pandas has a function get_dummies() that creates a dummy variable binarizing each categorical variable of an input column.

Image by Author
Image by Author

This removes the NOC columns and in its stead, adds binary columns for each country prefixed by "NOC_", indicating whether an individual athlete represented that country. Thus, ‘NOC_USA’ tells us whether a competing athlete was representing America and ‘NOC_CAN’ tells us whether an athlete represented Canada.

Let’s now add this new variable to our model and see if it produces a better fit to our swimming data. We’ll remove age due to its poor performance and plot by height again.

This gives us a much better fit than the previous models, as shown both visually and also by the notable drop in Deviance from 14202 to 11922. With the United States’s reputation in swimming, it makes sense that being American would boost one’s chances of success based off of the data. If we establish a cutoff probability of 0.6 or 0.7 as a threshold to determine medal outcome, we would obtain a very close fit to the binary data.

Conclusion

In this article, we learned about GLMs using a comprehensive Olympics dataset as an example in the case of logistic regression, as well as some data formatting/cleaning techniques along the way. While GLMs can be a powerful tool for statistical modeling, they do have their drawbacks. If you have a dataset with a large number of features, you may end up overfitting your model or encounter computational bottlenecks in performance depending on the size of your dataset. In addition, there are other features not in this dataset that could have improved our model, such as athletic performance in non-Olympic venues (e.g., Wimbledon for tennis). There are also other options for modeling this data that may be more effective, such as support vector machines or neural networks. Finally, with historic datasets like these, there’s likely to be missing values or errors resulting from disputing views, so keep that in mind as you draw conclusions from this and similar datasets.

I hope you enjoyed this article. I encourage you to play around with this dataset on your own and do your analysis. Let me know your experience in the comments below or if you have any questions. Happy coding!


Related Articles