A Series on Time

Introduction to Time Series Analysis — Data Wrangling and Transformation with Python

How to Prepare and Analyze Time Series Data

Tonichi Edeza
Towards Data Science
6 min readMay 19, 2021

--

Photo by Thought Catalog on Unsplash

Oftentimes Times Series lessons can focus mainly on the application of statistical tests and the creation of forecasting models. However, I find this assumes a lot about the competency level of readers. If you are like me then you probably only know the concept of Times Series via your undergraduate Statistics classes and YouTube videos.

In this article we shall go over how you would go about wrangling the data and performing exploratory data analysis.

Let’s Begin!

As always let us import the required Python libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr

Excellent, now let us load the data. For this article we shall make use of the Avocado Prices data by Justin Kiggins. Such data can be found here.

Let us load the data via Pandas.

df_av = pd.read_csv('avocado.csv')
Fig 1. The Raw Data

We can see that the data is not simple and clean column of data, in fact there are actually multiple time series variables. This will eventually allow us to engage in a richer analysis; however, let us focus first on cleaning the data.

We can see that the data has a Date column. For convenience let us assign this column to the index and get rid of the Unnamed column.

df_av = pd.read_csv('avocado.csv', index_col='Date')
df_av.drop(df_av.columns[0],axis=1,inplace=True)
df_av.head(5)
Fig 2. Cleaner Data

This data is now more usable, let us now plot a line chart (a common visualization choice for time series data). To keep things simple let us only focus on Total Volume.

ax = df_av['Total Volume'].plot(figsize=(20,5))
ax.set_title('Total Volume of Avocados', fontsize = 22)
ax.set_xlabel('Year', fontsize = 15)
ax.set_ylabel('Volume', fontsize = 15)
plt.grid(True)
plt.show()
Fig 3. Total Volume of Avocados over Time

We can see that the graph nice plots out the Total Volume of Avocados in the Time Period; however, some of you reading may think the data looks a little strange. If you thought that then congratulations, you seem to have a pretty good intuition on the nature of Time Series data.

If we look at the data again we can actually see what is going on.

Fig 4. Looking at the Data Again

We can see that the data itself is not purely Time Series, there is a geographical indicator tagged as region. The data that we have is actually what Statisticians and Economists refer to as Panel Data. The data itself contains 54 different regions with all of them having an equal amount of data (with the noticeable exception of New Mexico).

Fig 5. Count of all Region Values

Panel Data is a slightly different animal altogether, so for the sake of continuing let us choose a single region. In my case I chose to focus on San Francisco.

Fig 6. Volume of Avocados in San Francisco

We can see that the data is still rather strange. This is due to there being two kinds of avocados listed in the data, Conventional and Organic. Let us plot those two on separate graphs.

Fig 7. Disaggregate Avocado Volume in San Francisco

We can see that the volume of Conventional avocados is far larger than that of Organic Avocados; however, both seem to exhibit a similar general increasing trend. Let us move forward by focusing only on the volume of Conventional Avocados.

Instead of getting the daily value, let us instead look at the average volume per month. To do that must first find a way to extract the month. So far the only date data we have is the Index data.

df_sfo_co = df_sfo[df_sfo['type'] == 'conventional']
df_sfo_co.index
Fig 8. Index Data

We can see that the index contains the month data. To extract it let us employ list comprehension. Note that for the purposes of the article we will extract a both the year and the month. The reason will become apparent later.

[f"{i.split('-')[0]} - {i.split('-')[1]}" for i in df_av.index]
Fig 9. Extract Year-Month Label

The data can then be placed into the original dataframe.

df_sfo_co['year_month'] = [f"{i.split('-')[0]} - {i.split('-')[1]}"  
for i in df_sfo_co.index]
df_sfo_co.index
Fig 10. Year-Month Column Added

Now that we have attached the year_month data, let us now collapse the data by getting the monthly mean of the volume rather than the actual number.

df_sfo_co = df_sfo_co.groupby('year_month', as_index=False).agg({'Total Volume':'mean', 'year' : 'mean'})
df_sfo_co
Fig 11. Year-Month Aggregate

Excellent, this looks a lot more like your typical time series data, let us plot it.

ax = df_sfo_co['Total Volume'].plot(figsize=(20,5))
ax.set_title('Volume of Avocados', fontsize = 22)
ax.set_xlabel('Month', fontsize = 15)
ax.set_ylabel('Price', fontsize = 15)
ax.set_xticks(np.arange(0, len(df_sfo_co['Total Volume'])+1, 2)) ax.get_yaxis().set_major_formatter(tkr.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.grid(True)
plt.show()
Fig 12. Monthly Aggregate Line Plot

Lastly, it is best to always remember that time series data does not need to be presented in the above format. We can also set the charts up so that the x axis only contains the months and have several lines represent several years.

The wrangling for this is rather extensive but bear with me.

df_sfo_pivot = df_sfo_co.pivot(index='year_month',
columns='year',
values='Total Volume')
shift_value = -12
year_list = [2016, 2017, 2018]
for i in year_list:
df_sfo_pivot[i] = df_sfo_pivot[i].shift(shift_value)
shift_value -= 12
df_sfo_pivot.dropna(how = 'all', inplace = True)
df_sfo_pivot.index = list([i.split('-')[1] for i in df_sfo_pivot.index])
df_sfo_pivot
Fig 13. Wrangled and Pivoted Data

Excellent, now the only task left to do is to plot it.

Fig 14. Stacked Line Plots

The above graph allows us to better see the differences in between the years as well as if seasonality within years exists. From the chart above we can clear see the presence of spike in demand every February followed by a noticeable drop in March. There then seems to be no noticeable seasonal component until the rise in demand in December of 2016 and 2017.

In Conclusion

Time series analysis is a powerful tool for any data scientist who is involved with the spotting of trends and the analysis of markets (particularly the equities market). Though this article does not go into specific Time Series analysis techniques, I found it prudent to write about the Data Wrangling and Transformation aspects of the practice. Such a skill is crucial not just for Time Series analysis, but for everyone who works with data.

--

--