Time Series Forecasting

Build Foundation for Time Series Forecasting

A Tutorial on Time Series EDA and Data Preparation using Python

Ajay Tiwari

Published in

Towards Data Science

14 min readJul 12, 2020

Background

Time series forecasting is an important area of machine learning, there are so many use cases across the industries that involve a time component such as demand forecasting by a retailer for next week, call volume forecasting to solve the workforce optimization problem, predicting energy consumption, the weather forecast for proactive disaster management and many more. Time series analysis and forecasting can also be used for anomaly detection.

What is the time series forecasting and how it is different from other machine learning problems?

In both types of problems, time plays a role, historical data being used to train a model to predict the future. Both machine learning datasets are a collection of observations. So, what is the difference?

In a normal machine learning dataset, all historical observations are treated equally, whereas time series is a sequence of observations, they are captured sequentially in time, therefore time series adds an order dependence between observations.

Let’s understand some common notations.

t-n: A prior time or lag, e.g., t-1 is previous time, also known as the lag of 1
t: Current time
t+n: A future time, e.g., t+1 is a next time to be forecasted

Time Series Components

In any machine learning problem, we start with an exploratory analysis to understand the data better which helps in choosing the appropriate algorithms. Similarly, in time series we decompose the series into 4 constituent parts: level, trend, seasonality, and noise to understand the data better.

Level, trend, and seasonality are further categorized as the systematic components as they characterize the underlying data with consistent patterns and can be modeled, whereas noise is a non-systematic component as it has a random variation which can not be modeled directly.

Let’s have a look at these 4 components.

Level: Level describes the average value of the series
Trend: Trend is the change in the series between two adjacent periods, this is an optional component, not necessarily present in all series
Seasonality: Seasonality describes a short term cyclic behavior over time, this is also an optional component, not necessarily present in all series
Noise: The random variation that can not be explained by the model, present in all series to some extent

A series is considered to be a combination of these four components, these can be combined additively or multiplicatively, let’s understand this with the following examples.

Examples of common trend and seasonality

Additive model: yₜ = Level + Trend + Seasonality + Noise
Multiplicative model: yₜ = Level x Trend x Seasonality x Noise

How do forecasting methods work?

These methods attempt to isolate the systematic part and quantify the non-systematic part, i.e., noise. The systematic part helps in generating point forecasts, whereas the level of noise helps in estimating associated uncertainties.

We have gone through a brief introduction of the time series. In the next sections, we will discuss various techniques to dissect time series into systematic and non-systematic parts and detect initial patterns.

Time series data preparation and analysis

An initial and most important step in forecasting is characterizing the nature of the time series and investigating potential problems that must be taken care of before applying any forecasting methods for effective results.

We are going to discuss all these steps one by one through some open-source datasets and some randomly generated datasets. We will be using the Jupyter notebook in Google Colab.

Load and Explore Time Series Data
Feature Engineering
Data Visualization
Resampling
Power Transforms
Exploring temporal structure

To begin with, let’s download the following datasets, open a Jupyter notebook, and import these python functions.

1. Load and Explore Time Series Data

Load a Time Series Data

Often we load datasets as a pandas data frames, here we can use read_csv() function to load the time series data as a series object, a one-dimensional array with time label for each row.

We should not forget to specify some parameters to ensure the data is loaded correctly as a Series. Let’s look at these parameters below.

header: ‘0’ specifies the header information is available for use.
parse_dates: ‘True’ helps pandas to recognize that data in the first column contains dates that need to be parsed. But there are always weird formats that need to be defined manually, in such a case adding a date_parser() function is the better approach.
index_col: ‘0’ hints pandas that our first column, the time series column contains our index information
squeeze: ‘True’ hints pandas that we only have one column and we want to use this as Series

Explore a Time Series Data

After loading, its recommended to have a glance at the data, we can use the head() function to look at the first five records, we can also specify the first n records to view.

Always its good idea to validate the number of observations in the given series to avoid any error.

We can slice and dice the time series by querying different time intervals. For example, let’s take a look at all the observations from January 1981.

As in other machine learning problems, calculating and reviewing summary statistics is an important step in time series data exploration as well, it gives us an idea about the distribution and spread of the values. The describe() function will help us in calculating these statistics.

2. Feature Engineering

Time series data must be transformed as a supervised learning dataset to be modeled with any machine learning algorithm. As we have seen above, the time series doesn’t have any dependent and independent variables. We have to construct a dataset with a target variable to be predicted and input variables to be used for prediction.

Creating lag Features

This is the classical approach of transforming time series forecasting problems into supervised learning problems. In this approach value at a time (t) is being used to predict the value at next time (t+1).

In Pandas, lag features can be created by shift() function, this creates column t by shifting dataset by 1 and the original series without shift represents t+1.

We can see that the previous time step (t+1) is the input (X) and the next time step (t) is the output (y) in our supervised learning problem. This concept is known as the sliding window method. We can see there is no input for the first observation that can be used to predict the sequence, and the last observation will not have known output (y). Therefore, we have to discard these rows.

Similarly, we can create multiple lag features as below.

Using summary statistics as features

Additionally, we can use summary statistics of a few lagged values as features such as the mean of the previous few values. This can be achieved by rolling() function.

Alternatively, we can use summary statistics of all prior values at each timestamp as features in our forecasting model. These features can be created by expanding() function.

Date and Time as features

Similar to any other supervised learning dataset, many dates related features can be derived from time stamps such as the hour of the day, day of the week, month, quarter, weekday and weekend, public holidays, etc. which we often find very useful.

3. Data Visualization

Time series naturally produce quite a popular visualizations, we all have seen a line plot of stock market movement even before we introduced to data science. There are many more visualizations beyond the line plot. We will learn these tools in this section.

Line plot

A line plot is perhaps the most popular, visualization for time series. In this plot, time is shown on the x-axis with observation values along the y-axis.

It is often helpful to compare line plots for the same interval with different time intervals, such as from day-to-day, month-to-month, and year-to-year. In the below example, we have compared the minimum daily temperature for 10 years across 365 days. Let’s group the data by year and create a line plot for each year.

Histogram and Density Plot

Another important visualization is the distribution of observations. Some linear time series forecasting methods assume normally distributed observations. We can check this by plotting histograms and density plot on raw observations and if required repeat this check on their transformed variant such as a log of original values.

Box and Whisker Plot

Another type of plot that is useful to summarize the distribution of observations is the box and whisker plot. This plot helps in detecting any outliers in the series.

We can use these side by side box plots at a yearly level to compare each time interval in a time series.

Heat map

A heat map is an amazing visualization tool, we can use this to visualize the distribution of temperature in different colors. Its quite self-explanatory, larger values are displayed with warmer colors (yellow) and smaller values are displayed with cooler colors (green).

Lag Scatter plot

Time series modeling assumes a relationship between current observation and the previous observation, this is also called lag of 1. We can show this relationship on a scatter plot.

Often it’s helpful to analyze the relationship with multiple lag values. We can run the same code in a loop to plot multiple lags.

Autocorrelation Plot

The strength of the correlation between observations and their lags can be quantified using the autocorrelation plot. In time series it is also called self-correlation, as we calculate correlation against lag values of the same series.

4. Resampling time series

Some time our observations are not at the right frequency, observations may be at a higher or lower frequency than our desired forecast frequency. For example, a business needs a daily forecast, but it has hourly or monthly observations. In such a scenario following two techniques can help us in correcting the frequencies as per business goals.

Upsampling

This is the process of increasing the frequency of the samples, such as weekly to daily. In the following example, we will interpolate data from monthly to daily. The Series Pandas object provides an interpolate() function with a nice selection of simple and more complex methods. Here we use the linear method.

Let’s visualize the linearly interpolated data.

We can try another popular interpolation method, i.e., polynomial or a spline to connect the values. We have to specify the number of terms in the polynomial.

We can visualize the new plot, this creates more curves and looks more natural.

Downsampling

Decrease the frequency of the samples, such as daily to weekly. The sales data is monthly, but business needs quarterly forecasts. Let’s resample the data according to business needs.

Visualize the resampled data.

5. Power Transforms

Data transformation is a commonly used approach in machine learning problems to refine raw features to improve their significance. Similarly, in time series forecasting we remove noise and improve signals using various mathematical transforms.

In this section first, we will discuss these techniques using randomly generated data and then will apply the technique on open-source data.

Polynomial Transform

A time series that has a quadratic or cubic growth trend can be made linear by transforming the raw data to its square root or cube root.

Let’s randomly generate a series with a cubic function to check the transformation effect.

Now, transform this data into a cube root, we can observe that this series looks normally distributed.

Logarithmic Transform

Sometimes we come across data with more extreme trends better known as exponential, such time-series data can be made linear by taking the log of the raw values. This is called the log transform.

Let’s generate data from exponential distribution using the following code.

Once again, we can transform this data back to linear by taking the natural logarithm of the raw values.

Box-Cox

In real life often we come across observations that don’t have a clear trend and before concluding an appropriate transformation approach, different transformations have to be tested. Luckily we have this statistical technique that analyses the given series and automatically performs the most appropriate transformation according to following lambda values. These are some common values for lambda and respective transformation.

First, we will manually transform based on our intuition, and next, we will use box-cox transformation and see the difference.

Manual transformation

After visualizing the raw line plot, we assumed taking a log would be an ideal transform and specified lambda =0. We can take a look at the log-transformed time series plot.

Box-Cox Transformation

Now, we will rely on Box-Cox for selecting an ideal lambda and transforming it accordingly.

We can see the lambda value is close to 0 and the plot also looks similar to manual transformation but looking closely at the histogram, we can see a new histogram looks more normal.

6. Exploring Temporal Structure

We have explored and prepared our time series data, performed necessary transformations, now its time to investigate the temporal structure including the predictive power of the data.

White Noise

This is an important concept in time series forecasting. If the series is white noise, this is just a series of random numbers and cannot be predicted.

The following conditions have to be investigated to detect white noise.

Check if series have a zero mean
Check if the variance is constant over time
Check if the correlation is 0 with lagged values

Let’s explore the statistical tools which can help us in detecting white noise.

Summary statistics: Check and compare the mean and variance of the entire time series against different time intervals.
Line plot: Line plot will give us a basic idea about inconsistent mean and variance over time.
Autocorrelation plot: Check the strength of the correlation between lagged observations.

We will discuss the above tools using a synthetically created white noise time series data. We will create this variable using a gaussian distribution with mean 0 and standard deviation 1.

Now, look at the summary statistics, mean is nearly 0 and the standard deviation is 1, this is expected in this example. In real life after looking at this summary, series can be distributed in multiple sub-series and respective statistics can be compared for any inconsistency in mean and variance over time.

Next, let’s look at the line plot of this data, it doesn’t look like any time series, this is just the collection of random numbers.

Histogram with a bell-shaped curve confirms the Gaussian distribution with mean 0 and variance 1. Again, we can validate this distribution for different time intervals.

Finally, we can investigate the strength of the correlation between lagged observations using the autocorrelation plot.

We can see there is no correlation between lagged observations. Therefore, we can conclude this series is white noise and cannot be predicted well.

Random Walk

Let’s discuss another concept that can help us understand the predictability of our time series forecast. A random walk is a series in which changes from one time period (t) to the next (t+1) are random. There is a misconception that a random walk is a sequence of random numbers like white noise. Let’s see how this is different.

The process used in generating a random walk forces dependency from one time step to the next, we can understand this dependency with the following equation, where X(t) is the next value in the series, X(t-1) is the value at the previous time step and e(t) is the white noise at next time.

X(t) = X(t-1) + e(t)

Let’s understand this on a number line, courtesy — MIT website.

The simplest random walk to understand is a 1-dimensional walk. Suppose that the black dot below is sitting on a number line. The black dot starts in the center.

https://www.mit.edu/~kardar/teaching/projects/chemotaxis(AndreaSchmidt)/random.htm

Then, it takes a step, either forward or backward, with equal probability. It keeps taking steps either forward or backward each time. Let’s call the 1st step a₁, the second step a₂, the third step a₃, and so on. Each “a” is either equal to +1 (if the step is forward) or -1 (if the step is backward). The picture below shows a black dot that has taken 5 steps and ended up at -1 on the number line.

Now, we will simulate a random walk using the same approach and plot these observations on a line.

We can see this shape looks like a movement of real stock index.

Now, let’s explore the tools which can help us in identifying a random walk within any time series

Autocorrelation plot

We know how random walk is created, so by its design, we expect a very high correlation with the previous observation and gradually this correlation will be reduced.

Augmented Dickey-Fuller test to confirm non-stationarity in the series

The way random walk is created, the series expected to be non-stationary, i.e., inconsistency in mean and variance over time.

The null hypothesis of this test is that time series is non-stationary. The test statistic is positive, much higher than the critical values, meaning we would have to fail to reject the null hypothesis that the time series is non-stationary structure.

We can make the time series stationary by taking first differencing and then again will analyze its characteristics.

This difference line plot also suggests that there is no information to work with other than a series of random numbers.

Finally, we can see the pattern through an autocorrelation plot and confirm no relationship between the lagged observations.

We have seen a random walk is also unpredictable like white noise; the best prediction we could make would be to use the value at the previous time step to predict the next time step as we know the next time step is a function of the prior time step. This forecasting approach is also known as the naive forecast,

Decompose Time Series

We have discussed the four components (level, trend, seasonality, and noise) of a time series at the beginning of this article, in this section, we will explore the automatic decomposition tool.

As we know time series is a combination of these four components, through decomposition we will break time series into these individual components to understand the data better and choose the right forecasting approach.

The Statsmodels library provides an implementation of the naive, or classical, decomposition method in a function called seasonal_decompose(). It requires that we specify whether the model is additive or multiplicative. We should specify this information after a careful examination of the line plot.

Additive decomposition

We can randomly generate a time series with a linearly increasing trend and decompose it as an additive model. Let’s look at the following example, we can see the trend component is the most dominant in the series and there is no seasonality. Residual plot shows zero noise, our simple method may not have separated the noise, we may test some advanced approach like STL decomposition.

Multiplicative decomposition

The example below decomposes the airline passenger dataset as a multiplicative model.

We can see that the trend and seasonality information extracted from the series does seem reasonable and noise is also available in the data, periods of high variability in the early and later years of the series can be observed from the plot.

Summary

We have tried to build a strong foundation towards the time series forecasting problems. We learned the basics of time series and discovering insights and discrepancies through visualizations, how to correct some of the data discrepancies, and refining time-series data through feature engineering and power transforms for an accurate forecast.

Though we discussed all the techniques using a univariate time series, these techniques can be easily applied to the multivariate time series forecasting problems.

Thanks for reading, hope you found the article informative. In the next few articles, I will be discussing different forecasting techniques, I will start with classical techniques.

References

[1] Galit Shmueli and Kenneth Lichtendahl, Practical Time Series Forecasting with R: A Hands-On Guide, 2016.

[2] Jason Brownlee, https://machinelearningmastery.com/