The world’s leading publication for data science, AI, and ML professionals.

A thorough guide to Time Series Analysis

Understand the components of Time-series data. Apply machine learning & statistical models to real-life data.

Photo by Aron Visuals from Unsplash
Photo by Aron Visuals from Unsplash

This article will guide you through the following parts:

  1. What is time-series data?
  2. The components of time-series data.
  3. What is time series analysis used for?
  4. The most used time series forecasting methods (statistical and machine learning).
  5. An end-to-end example using a machine learning model to predict climate data.

Without further ado, let’s get started!


1. What is Time-series data?

_A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time._ In plain language, time-series data is a dataset that tracks a sample over time and is collected regularly. Examples are commodity price, stock price, house price over time, weather records, company sales data, and patient health metrics like ECG. Time series data widely exists in our lives. Hence the ability to analyze it is crucial as a data scientist. It’s also interesting to play with.

Patients' ECG data (Image from the MIMIC-III Waveform Database)
Patients’ ECG data (Image from the MIMIC-III Waveform Database)

2. The components of time-series data

Most time-series data can be decomposed into three components: trend, seasonality and noise.

Trend – **** The data has a long-term movement in a series, whether it’s upwards or downwards. It may be caused by population growth, inflation, environmental change or the adoption of technology. Examples could be the long-term increase in the US stock market in the past ten years, and the growth in the real estate market in most parts of the world in the past year, and the longevity of people’s lives.

The trend in avocado price in the US in the past five years (Image generated from the Prophet model by the author)
The trend in avocado price in the US in the past five years (Image generated from the Prophet model by the author)

Seasonality – The data is correlated with calendar-related effects, whether it’s weekly, monthly, or seasonally, and it’s domain-specific. For example, for most e-commerce platforms, their sales around December rise because of Christmas. In contrast, for real estate, the volume of sold houses in the summer would be higher than in the winter in Canada because people are reluctant to move around in the winter season.

Temperature data in Delhi with a strong seasonality (Image by the author)
Temperature data in Delhi with a strong seasonality (Image by the author)

Noise – Noise is also known as residues or irregulars. It’s what remains after trend and seasonality are removed. It’s short-term fluctuation which is not predictable. Sometimes noise can be dominant compared with trend and seasonality, making this kind of time series data harder to forecast. The stock price is a manifest example.

White noise is the extreme situation of noise that has no trend and seasonality. Therefore it’s nearly impossible to predict, and it’s a kind of stationary time-series data.

White noise (Image by Morn from Wiki Media Commons)
White noise (Image by Morn from Wiki Media Commons)

3. What is time series analysis used for?

Time series analysis has different use cases in multiple industries. In the general rule of thumb, it can be used in the following scenarios:

  • Predict future values based on historical data, like predicting housing price, sale price and stock price.
  • Identify outliers or fluctuations in economics, business or health metrics, also known as anomaly detection. Examples could be identifying changepoints when economics is affected by geopolitical events or irregulars of patients’ vital signs.
  • Pattern recognition, signal processing, weather forecasting, earthquake prediction etc.

4. The most used time series forecasting methods

There are a handful of time series forecasting models in the literature. I will introduce the most widely used ones in this article: Facebook Prophet, a Deep Neural Network Model called LSTM, and Arima.

Facebook Prophet

As stated in the documentation, Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. This means Prophet has taken into account all the components mentioned above: trends, seasonality and noise, plus holiday effects, and combined them with an additive model.

Prophet has a Python API and R API. It’s pretty easy to implement and make forecasts.

We are going to apply this model to predict the temperature in India in the next section.

LSTM (Long short term memory)

Lstm is a kind of Recurrent Neural Network (RNN) that is good at handling sequence data. It’s widely used in Machine Translation and Speech recognition. If you are already familiar with the structure of RNN, LSTM added three special gates in each of its cells to remember long-term and short-term memories compared with Vanilla RNN models, which are bad at remembering long-term sequences.

The structure of each cell of LSTM (source)
The structure of each cell of LSTM (source)
Notation in the previous image (source)
Notation in the previous image (source)

You can also refer to [this blog](https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/) to better understand LSTM and this blog to get your hands dirty to forecast a time series data using LSTM in Python on an international airline passengers prediction problem.

ARIMA

ARIMA is a statistical method short for Autoregression integrated moving average. Autoregressive means the model uses the dependent relationship between an observation and some number of lagged observations. Integrated means differencing raw observations (e.g. subtracting an observation from observation at the previous time step) to make the time series stationary. Moving Average means the model uses the dependency between an observation and a residual error from a moving average model applied to lagged observations. It sounds a bit confusing; however, we will not dive deep into this method in the article, but you can turn to this blog to have a better understanding. It will give you a comprehensive introduction to ARIMA and how to use Python to implement ARIMA on a shampoo sales dataset.

5. Using Facebook Prophet to predict the daily mean temperature in India

Great! Now you better understand what time series data is, what it is constructed with, what it’s used for, and the most commonly used forecasting models. Now it’s time to play around with some real-life data and start to predict! You can get access to the notebook through this git repo.

The dataset we are using provides training and testing climate data from 1st January 2013 to 24th April 2017 in Delhi, India, with four features: meantemp, humidity, _windspeed, meanpressure. It is collected from Weather Underground API, and published on Kaggle.

We will be using Google Colab as our python notebook, and let’s first install the packages needed.

pip install pystan==2.19.1.1 Prophet

We are installing an unfamiliar package named pystan. Under the hood, Prophet is using Stan for optimization (and sampling if users desire), Stan is a platform for statistical modelling and high-performance statistical computation, and pystan is the python interface to Stan.

Import necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet
from datetime import datetime
from prophet.plot import plot_plotly, plot_components_plotly

Then let’s download the data from Kaggle API. You can refer to my other post on how to download data from Kaggle to Google Colab. I will post the code here for reference:

! pip install kaggle
from google.colab import drive
drive.mount('/content/gdrive')
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/TimeSeries"  #The folder TimeSeries is where you kaggle.json file is.
!kaggle datasets download -d sumanthvrao/daily-climate-time-series-data
!unzip *.zip && rm *.zip

After zipping the dataset zip folder, we can find a training dataset along with a testing dataset. We will use the training dataset to train our Prophet model and use the testing dataset to test the model’s accuracy.

df = pd.read_csv('/content/DailyDelhiClimateTrain.csv')
df_test = pd.read_csv('/content/DailyDelhiClimateTest.csv')

Both training and testing datasets have five columns, and we can use either feature as the label for forecasting. As an example, we’ll l use meantemp for the tutorial. Prophet just needs two columns with specific names: ‘ds’ and ‘y’. So the next step is to filter out the columns and rename them. Be wary of the duplicate in both datasets: the date 2017–01–01 exists in both sets. You can delete it in either dataset. In my case, I deleted it in the training set and took it as a projected day.

df=df[['date','meantemp']]
df=df.rename(columns={'date':'ds','meantemp':'y'})
pd.to_datetime(df.ds)
# Delete the last row (2017-01-01)
df = df[:-1]
df.head()

This is what the data look like now, and it is ready to be fed into the Prophet model.

The first five rows of the training data for Prophet (Image by the author)
The first five rows of the training data for Prophet (Image by the author)

Let’s initialize the model and forecast! We are going to forecast the future 114 days as the testing data has the exact days’ true value.

# Initialize model
m = Prophet(interval_width=0.95, yearly_seasonality=True, weekly_seasonality=True,daily_seasonality=True)
# Add monthly seasonality to the model
m.add_seasonality(name='monthly', period=30.5, fourier_order=5, prior_scale=0.02)
# Fit the model with training data and make prediction
m.fit(df)
future = m.make_future_dataframe(periods=114)
forecast = m.predict(future)
fig = m.plot_components(forecast)
Components of the temperature time-series data (Image by the author)
Components of the temperature time-series data (Image by the author)
figure = m.plot(forecast)
Prophet forecasted figure (Image by the author)
Prophet forecasted figure (Image by the author)

Let me explain a bit about the graph above. The black dots are the historical data, which we derived from the Kaggle API. The dark blue line is the forecasted data. The light blue upper bound and lower bound are also predicted based on an 80% confidence interval. You can see the blue data is beyond the black dots, and these are the 114-day forecasts.

You can also use Plotly to generate an interactive graph for a better understanding of the data points.

plot_plotly(m, forecast)
Interactive graph generated by Plotly (Image by the author)
Interactive graph generated by Plotly (Image by the author)

All set! Let’s see how accurate Prophet is!

We will check the accuracy using the metric R-squared. Luckily Python has a built-in function in the sklearn library. Let’s use it directly!

from sklearn.metrics import r2_score
y_true = df_test.meantemp
y_pred = forecast.yhat.tail(114)
r2_score(y_true, y_pred)
R_squared between the prediction and real values (Image by the author)
R_squared between the prediction and real values (Image by the author)

The accuracy is not bad! You should be able to achieve a better result by configuring the parameters of the Prophet model. For instance, adding changepoints, and holidays; change seasonality parameters.

The next thing we can do is to build a web application and deploy the model to the cloud. But we have already covered a lot today, so I won’t elaborate on the topic of how to deploy the model, but if you are interested, please leave a comment below. I will compose another post about deploying a machine learning model to a dynamic web application using the Django+JQuery+Heroku+VSCode stack.

Thanks for reading. Let me know if you have any thoughts or advice, and I’m happy to get you connected on Linkedin.


Related Articles