The world’s leading publication for data science, AI, and ML professionals.

Analyzing 120 Years of Olympics History with SQL

Hands-on SQL tutorial

Photo by Matt Lee on Unsplash
Photo by Matt Lee on Unsplash

I have come across a great dataset on Kaggle. It was shared under creative commons license so we can use and share it publicly. I thought it would be exciting to learn about the history of olympics.

There are many alternative tools for performing an analysis on this dataset. I have chosen Sql for no particular reason. In this article, we will be learning about the Olympics while practicing SQL.


Note: SQL is used by many relational database management systems such as MySQL, SQL Server, PostgreSQL, and so on. Although they mostly adopt the same SQL syntax, there might be small differences. In this article, we will be using PostgreSQL.


I have selected some of the columns and uploaded the data to an SQL table. Let’s start by taking a look at the olympics table.

SELECT * FROM olympics LIMIT 5
(image by author)
(image by author)

We have the name of the athletes along with their gender, age, and the sport they play. The table also contains details about the games. The original dataset contains 271116 rows but I only uploaded 200000 to the olympics table. If you do the same analysis on the entire dataset, the results might slightly be different.

I would like to see how many different athletes have participated in the olympics. We can count the distinct number of names.

SELECT COUNT(DISTINCT(name)) FROM olympics
99875

The table contains almost 100k different athletes. I wonder how many of them won a medal.

SELECT COUNT(DISTINCT(name)) FROM olympics
WHERE medal IS NOT NULL
20560

We have just added a condition using the where statement.

Let’s also find the top 3 countries in terms of the total number of medals.

SELECT team, COUNT(medal) AS medal_won
FROM olympics
WHERE medal IS NOT NULL
GROUP BY team
ORDER BY medal_won DESC
LIMIT 3
(image by author)
(image by author)

The query above filters the rows in which the medal column does not have a null value. Then, the rows are grouped by the team column and the rows are counted which gives us the number of medals won. Finally, the result set is ordered by the medal count in descending order and the first three are selected.

It would be interesting to know the average age of female and male athletes who won a medal.

SELECT sex, AVG(age) AS avg_age
FROM olympics
WHERE medal IS NOT NULL
GROUP BY sex
(image by author)
(image by author)

The female athletes are two years younger than the male athletes.

We can also check the top 5 sports in which female athletes have won the most medals.

SELECT sport, COUNT(medal) AS medal_count 
FROM olympics
WHERE sex = 'F'
GROUP BY sport
ORDER BY medal_count DESC
LIMIT 5
(image by author)
(image by author)

Consider we are interested in the year of the games. The table does not have a separate year column but it can be extracted from the games column.

SELECT LEFT(games, 4) AS year, games
FROM olympics
LIMIT 5
(image by author)
(image by author)

The left function allows for selecting parts from strings. It takes the column name and the number of characters to be selected. The right function does the same but counts from the right.

The next example is more of an SQL practice than exploring the dataset. Consider we only need the data after 1950. We can either delete the rows that belong to the games in or before 1950 or create a new table with the data we are interested in.

The following query creates a new table using data in the olympics table.

CREATE TABLE olympics_after_1950 AS
SELECT * FROM olympics
WHERE CAST(LEFT(games, 4) AS INTEGER) > 1950

Let’s check if the query has worked correctly.

SELECT MIN(LEFT(games, 4)) FROM olympics_after_1950
1952

The create table statement is used for creating a table, surprisingly 😃 . We can populate the new table by selecting data from a different table using the select statement along with the create table statement.

We could also update the olympics table by deleting the rows that belong to the games in or before 1950. I want to keep the original table as is. We can practice the update statement on the new table.

Let’s delete the rows before 1970 in the new table and check the minimum date afterwards.

DELETE FROM olympics_after_1950
WHERE CAST(LEFT(games, 4) AS INTEGER) < 1970
SELECT MIN(LEFT(games, 4)) FROM olympics_after_1950
1972

We have done several examples to gain insight into the history of olympics. We have also practiced basic SQL statements and functions.

Thank you for reading. Please let me know if you have any feedback.


Related Articles