The world’s leading publication for data science, AI, and ML professionals.

Navigating Time Series With SQL

Consolidate your SQL knowledge by analyzing sequential data

Photo by Lorenzo from Pexels
Photo by Lorenzo from Pexels

One of my biggest mistakes starting out with SQL was that I attempted to learn everything through conceptualization.

I mean, if you have a programming background, you should be fine as long as you understand the logic behind SQL keywords and functions, right?

Wrong.

Learning to write SQL by studying the syntax alone is like learning to speak English by reading a dictionary; it is simply insane.

Your time would be better spent applying your knowledge in projects as soon as you think you have reached a certain milestone.

That being said, I find that many people shoot themselves in the foot by only picking relatively simple datasets for their projects. With simpler datasets, you are limited to the use of the more basic SQL keywords and functions.

This is a great place to start, but it denies you the chance to apply more advanced techniques, such as CTEs and window functions.

Understand that real data is messy and unorganized; it isn’t as friendly as the prime and proper tables you see in tutorials.

So, if you want to exhibit your analytical prowess and SQL know-how, it would be best to take on datasets that showcase those traits.

Times Series With SQL

Working with a time series dataset can be conducive to your SQL learning for many reasons.

Time series data, by nature, store records that are not independent of each other. Analyzing such data will require conducting more complex calculations between columns and between rows.

This will require you to demonstrate your understanding of both simple and advanced SQL techniques.

Thus, I strongly advocate performing a time series analysis for those looking to consolidate their understanding of SQL principles.

Case Study

Let’s try writing an SQL script for a dataset showing a time series of vaccinations in each country.

The dataset that will be used for this demonstration can be accessed here (copyright-free). The Vaccinations table shows various vaccination statistics of different countries collected on a daily basis.

For the sake of reducing the scope of the analysis, let’s limit our study by focusing on 3 features: the country, date, and number of daily vaccinations.

Note: The following code snippets are run in SQL server, so the syntax might slightly deviate from the SQL dialect you’re most familiar with.

Preview Of Code Output (Created By Author)
Preview Of Code Output (Created By Author)

What if we were not interested in this amount of information or this level of granularity?

Let’s say that we are only interested in observing vaccinations in each country on a monthly basis.

Also, instead of every country in the world, let’s also say that we are only interested in vaccinations in the UK (i.e. England, Scotland, Wales, and Northern Ireland).

We can filter out unwanted records and convert the daily vaccination records into monthly records with the following query.

Preview Of Code Output (Created By Author)
Preview Of Code Output (Created By Author)

Any analysis of the vaccinations in the UK on a monthly level will require the use of this generated table, so let’s store this data into a temp table called "#Monthly_Vaccinations" for convenience.

With this temp table, we can start extracting insights from the data to learn more about the vaccination programs in each country.

What is the 3-month moving average of the vaccinations in each country?

Moving averages are frequently used in time series analysis since it helps reduce noise by limiting the influence of erratic changes in the observed variable.

To compute the 3-month moving average, we need to:

  1. Find the vaccination values for the current month and the two preceding months.
  2. Find the average number of vaccinations within the 3 months for each row.

Fortunately, we can accomplish all of this in one line of code using a window function.

Preview Of Code Output (Created By Author)
Preview Of Code Output (Created By Author)

What is the percent change in monthly vaccinations in each country?

Another important variable to consider is the percent change in vaccinations. After all, any dramatic rise or decline in vaccinations will be of interest to public health officials.

To find the percent change in monthly vaccinations, we need to:

  1. Find the number of vaccinations in the previous month for each row
  2. Compute the percent change in vaccinations for each row given the current and previous vaccination values

We can find the previous vaccination value for each record with the LAG function. Note that the previous vaccination value for the first row is 0.

Preview Of Code Output (Created By Author)
Preview Of Code Output (Created By Author)

This table gives us all that we need to compute the percent change in vaccinations for each row. After storing it in a CTE, we can compute the percent change in vaccinations in each month for each country.

Preview Of Code Output (Created By Author)
Preview Of Code Output (Created By Author)

Which months registered the lowest vaccinations for each country?

To identify the months with the worst performance for each country, we need to:

  1. Rank the records in terms of the number of vaccinations (lowest to highest)
  2. Select the rows that have the highest rank (i.e. rank=1)

We can use the RANK function to order the records of each country in terms of the number of vaccination records.

Preview of Code Output (Created By Author)
Preview of Code Output (Created By Author)

After storing this data with a CTE, we can follow up with a query to select the records in each country that rank the highest.

Code Output (Created By Author)
Code Output (Created By Author)

All four countries register the least number of vaccinations in month 9 (September).

The RANK function allows us to conduct a more in-depth search of the months with the poorest performance.

For example, if we wanted to find the 3 worst months for each country, we would simply need to tweak the WHERE statement.

Code Output (Created By Author)
Code Output (Created By Author)

Which country is currently the most successful in administering vaccines?

One good approach for directly comparing countries would be to find the ratio of total vaccinations to the population for each country.

Considering the countries’ populations is logical. It would be unfair to directly compare each country in terms of total vaccinations since the countries have a large disparity in population sizes.

To find the country with the best vaccination to population ratio, we need to:

  1. Find the total vaccines (cumulative) administered in each country
  2. Find the population of each country
  3. Calculate the vaccination to population ratio for each record
  4. Keep only the records from the latest month

Firstly, we need to know how many total vaccines have been administered in each country. Currently, we are only provided with the number of vaccines administered each month.

Finding the cumulative vaccination count will require the calculation of a running total. This can be achieved with a window function.

Preview of Code Output (Created By Author)
Preview of Code Output (Created By Author)

Next, we need to know the population of each country.

Let’s create a table that contains this information.

Code Output (Created By Author)
Code Output (Created By Author)

Let’s use the number of vaccinations per 100k people as our evaluation metric of choice.

We can directly compare the countries by merging the CTE named "running_total" and the Population table with a join and then computing the number of vaccinations per 100k people.

Preview Of Code Output (Created By Author)
Preview Of Code Output (Created By Author)

The final step is to store this information with a second CTE and filter out the unwanted records. The result will show the total vaccinations in the latest month for each country.

Code Output (Created By Author)
Code Output (Created By Author)

Based on the evaluation metric, Scotland is doing the best job vaccinating its people (so far).

Conclusion

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

Now you have gained some insight into how SQL key terms and functions can be applied to time series data.

That being said, even if you have learned a lot from this demonstration, you will make even better progress by carrying out a similar analysis yourself.

This case study has only included countries in the UK. I invite you to take this dataset and perform a similar analysis for other countries.

I wish you the best of luck in your Data Science endeavors!

References

Preda, G. (2021). COVID-19 World Vaccination Progress, Version 221. Retrieved November 27, 2021 from https://www.kaggle.com/gpreda/covid-world-vaccination-progress.


Related Articles