Hands-on Tutorials

How to collect data from Statistics Canada using Python

A step by step tutorial on Statistics Canada API and python library

Rui Zhang
Towards Data Science
5 min readJan 13, 2021

--

Photo by Alex Shutin on Unsplash

Statistics Canada provides a wide range of data for public use, including data for Canada’s economy, society and environment. In addition to manually browsing the webpage to access the data, Statistics Canada also provides Web Data Service, i.e. API access to their database. What’s even better, there is a python library stats_can that wraps up the API commands and allows users to write simple codes to read data automatically. This article will provide a step by step tutorial on how to achieve that.

Identify desired data item and its vectorID

Go to Statistics Canada subpage Data, input keywords, for instance, GDP, and click on the search icon.

Locate the desired item in returned results.

On the GDP page, click on Add/Remove data.

Click on the tab Customize layout.

Scroll down and tick the option Display Vector identifier and coordinate, then click the button Apply.

Now the table will be reorganized and you will find a new column Vector, here we will select GDP for all industries, so the targeted vectorID is v65201210. This will be used later in the Python codes.

Install stats_can in Python

The library can either be installed with conda or pip:

conda install -c conda-forge stats_can

Or:

pip install stats-can

Collect the data

We will use the function stats_can.sc.vector_to_df to read data from Statistics Canada. The syntax of the function is as below:

df = stats_can.sc.vectors_to_df(vectors, periods = n, start_release_date = None, end_release_date = None)

It returns DataFrame of vectors with n periods data or over range of release dates. The function wraps up two API commands, which are get_bulk_vector_data_by_rangeand get_data_from_vectors_and_latest_n_periods to turn the resulting list of JSONs into a DataFrame.

For instance, with the vectorID v65201210 for GDP, you can run following codes to collect GDP for the last 6 months:

import stats_candf = stats_can.sc.vectors_to_df('v65201210', periods = 6)

You can rename the index and column to make the table look nicer:

df.columns = ['GDP']
df.index.names = ['Date']

the returned table looks like below:

Date        GDP        
2020-05-01 1726385.0
2020-06-01 1833032.0
2020-07-01 1879281.0
2020-08-01 1896592.0
2020-09-01 1912085.0
2020-10-01 1920056.0

An example that collects and visualizes multiple economic indexes

In a recent project on forecasting sales with synthetic data, I collected multiple eco indexes such as GDP, CPI, unemployment rate, etc. to build a multi-variant regression model for the purpose of forecasting financial transaction counts.

Firstly, I figured out vectorIDs for all eco indexes I plan to use, including GDP, CPI, TSX, unemployment rate and exchange rate (USD to CAD), and put them into a dictionary:

eco_vec_map = {'CPI':'v41690973',
'Exchange_Rate_USD':'v111666275',
'GDP':'v65201210',
'Unemployment_Rate':'v91506256',
'TSX':'v122620'}

Secondly, I converted values from the dictionary into a list and run the function, with an input of periods = 36 (i.e. the last 3 years), the function automatically recognized the vector list and returned tables with multiple columns:

vectors = list(eco_vec_map.values())
df = stats_can.sc.vectors_to_df(vectors, periods = 36)

I then renamed the columns with their eco index name, note that a mapping trick is used here, so that you won’t have to worry about the order of the columns:

inv_map = {v: k for k, v in eco_vec_map.items()}
df.columns = df.columns.to_series().map(inv_map)
df.index.names = ['Date']

The returned table looked like below:

Date        TSX       CPI    GDP       Unemployment_Rate  Exchange_Rate_USD
2017-09-01 NaN NaN 1905358.0 NaN NaN
2017-10-01 16025.59 130.9 1907384.0 NaN NaN
2017-11-01 16067.48 131.3 1917258.0 5.4 1.2769
2017-12-01 16209.13 130.8 1922552.0 5.3 1.2771
...
2020-08-01 16514.44 137.0 1904249.0 11.0 1.3222
2020-09-01 16121.38 136.9 NaN 8.4 1.3228
2020-10-01 NaN NaN NaN 8.1 1.3215

Finally, I visualized all columns of the table:

df.plot(subplots = True, figsize = (14,8), layout = (3,2));

It showed that except CPI, other eco indexes experienced a sharp decline or increase near March-April 2020, which is likely impacted by the pandemic.

Summary

This article provided detailed instructions on how to collect data using Statistics Canada’s library in Python. The major steps are:

  • Identify vectorIDs for targeted data items.
  • Install stats_can library and call functionstats_can.sc.vectors_to_df to collect data and store as a DataFrame table.

I had a difficult time finding good tutorials online so I wrote this article hoping to help others. For any questions or suggestions, don’t hesitate to reach out to me via Linkedin. I’d be happy to connect :)

References:

[1] Web Data Service User Guide, Statistics Canada. https://www.statcan.gc.ca/eng/developers/wds/user-guide#a12-3

[2] A python library for reading data from Statistics Canada.
https://stats-can.readthedocs.io/en/latest/index.html#

--

--

Data scientist with a background of oceanography, rock climbing enthusiast, immigrant in Canada. Linkedin: https://www.linkedin.com/in/rui-zhang-ocean/