data:image/s3,"s3://crabby-images/6ef7d/6ef7d02357949aaa747aaa7cd4a2e624f80fd6d0" alt="Photo by Carrie Beth Williams on Unsplash"
Last spring, in a course I took, I was tasked by a client to visualise some yearly historical data as line graphs. As there was very little data available per year, the client wanted to group the years into periods. Furthermore, if the year range would not split evenly with the period length selected, the last period should be left shorter.
Let me walk you through my solution
For starters, we need some data that we can divide into periods based on year column. I’ve chosen to demonstrate today with Canadian climate history data that can be found from Kaggle with Creative Commons licence.
We’ll be using Pandas, NumPy and Plotly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Let’s start by reading the data into a DataFrame. Then, let’s narrow our DataFrame down to only include the columns ‘LOCAL_DATE’ and ‘MEAN_TEMPERATURE_TORONTO’.
df = pd.read_csv('Canadian_climate_history.csv')
df = df[['LOCAL_DATE', 'MEAN_TEMPERATURE_TORONTO']]
To create a year column, let’s first change the ‘LOCAL_DATE’ column to datetime, its initial type is object. From a datetime type column, we can extract the year information as follows.
df['LOCAL_DATE'] = pd.to_datetime(df['LOCAL_DATE'])
df['YEAR'] = df['LOCAL_DATE'].dt.year
The resulting column is of type integer, as was in the data I had in the spring.
0 1940
1 1940
2 1940
3 1940
4 1940
...
29216 2019
29217 2019
29218 2019
29219 2019
29220 2020
Name: YEAR, Length: 29221, dtype: int64
For sake of the results, we’ll be investigating the range 1940–2019 and period length 20. With the full year range we would get a nice even split but as there is only one measurement from 2020 in the data, it’s not meaningful.
period_length = 20
start_year = 1940
end_year = 2019
df = df[(df['YEAR'] >= 1940) & (df['YEAR'] <= 2019)]
For checking if the split is even, and for determining the length of this last period, we’ll take the modulo (= how much is left after taking the last even split).
year_range = end_year - start_year
modulo = year_range % period_length
# For 1940–2019 the modulo is 19
Next, let’s find the starting and ending years for our last period. The addition of one is done to include the last year as well. Otherwise, 2019 would be left out, as it would be the starting year for the next period if there was one.
if modulo == 0:
final_start = end_year - period_length
else:
final_start = end_year - modulo
final_end = end_year + 1
Then I continued, with NumPy, to create a list of all the earlier starting years for the year range.
starts = np.arange(start_year, final_start, period_length).tolist()
Let’s then create lists of tuples, where each tuple is like (period_start, period_end). From these tuples we can finally create our bins as Pandas IntervalIndex.
tuples = [(start, start+period_length) for start in starts]
# We'll add the last period calculated earlier
tuples.append(tuple([final_start, final_end]))
bins = pd.IntervalIndex.from_tuples(tuples, closed='left')
These bins then convert to labels nicely by converting them to string. For example, ‘[1940, 1960)’ would be the first label when period length is 20. Here we see how the bin shows that the period is closed on left and open on right. For clarity, I then agreed with the client to represent them as ‘1940–1959’. I created a dictionary for easy replacement in the DataFrame.
original_labels = list(bins.astype(str))
new_labels = ['{} - {}'.format(b.strip('[)').split(', ')[0], int(b.strip('[)').split(', ')[1])-1) for b in original_labels]
label_dict = dict(zip(original_labels, new_labels))
# The label dict when year range is 1940-2019
{'[1940, 1960)': '1940 - 1959',
'[1960, 1980)': '1960 - 1979',
'[1980, 2000)': '1980 - 1999',
'[2000, 2020)': '2000 - 2019'}
Then with Pandas cut(), we can easily place the content of year column into those bins and create a new column ‘PERIOD’. Finally, the bin labels are replaced with help of the label_dict.
# Assign each row to a period
df['PERIOD'] = pd.cut(df['YEAR'], bins=bins, include_lowest=True, precision=0)
df['PERIOD'] = df['PERIOD'].astype("str")
df = df.replace(label_dict)
Lastly, we group the data by the PERIOD column. I drop the YEAR column just to make the df neater.
df = df.groupby('PERIOD').mean().reset_index().drop('YEAR', axis=1)
And now we can visualise it with Matplotlib
# Styling of the figure
fig = plt.figure()
fig.set_figwidth(12)
fig.set_figheight(5)
fig.patch.set_facecolor('white')
plt.plot(df['PERIOD'], df['MEAN_TEMPERATURE_TORONTO'], color='#004cff')
plt.title(f"20-year average of daily mean temperatures in Toronto between {start_year} and {end_year}")
plt.xlabel("Period")
plt.ylabel("Mean temperature")
# Styling of the plot
plt.grid(color = 'white', linewidth = 1)
plt.locator_params(axis='y', nbins=6)
# Styling of the axes
ax = plt.gca()
ax.set_facecolor('#e6ecf7')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.tick_params(axis=u'both', which=u'both',length=0)
plt.show()
data:image/s3,"s3://crabby-images/05003/0500343e30110c27e61e74a26669508ac97f4393" alt=""
Conclusions
Creating periods for yearly data is beneficial when there is only a small number of data points available per year, or when the number of data point varies a lot between years. For clarity, I encourage labelling the periods with exact years (e.g. ‘1940–1959’) rather than as mathematical intervals with information whether limits are closed or open (e.g. ‘[1940–1960)’).
Thank you for reading! If you learned something new or enjoyed this article, follow me on Medium. I am currently working on future articles about NLP and data engineering. You can also find me on LinkedIn.
Do you have other ideas for completing this task? It would be awesome if you let me know in the comments 😊