The world’s leading publication for data science, AI, and ML professionals.

Fast Time Series Forecasting with StatsForecast

Forecast lightening-fast univariate time series with Nixtla's StatsForecast package

Photo by Pixabay.
Photo by Pixabay.

StatsForecast is a package that comes with a collection of statistical and econometric models to forecast univariate time series. It perfectly works with large time-series and not only claims to be 20x faster than the known pmdarima package but also 500x faster than fb prophet.

This article provides you a first overview of the StatsForecast package and how to use it. To demonstrate its super-fast performance but also its usage with single and multiple time series, we will work with two data sets.

We’ll use the Australian total wine sales data set for the single time series. The data set will be used to compare the auto_arima function of StatsForecast with the one from the well-known pmdarima package. The second one is an excerpt of the M4 data set, which contains 1.476 time series. The idea behind the second example is to show you how to prepare your data to predict multiple time series with the package.

Why is it so fast?

Before we start, you might be skeptical and wonder what is its secret for being so fast. There are two main reasons for that.

First, StatsForecast uses Numba. Numba is a Just-In-Time (JIT) compiler for Python that works pretty well with NumPy code and translates parts like arrays, algebra functions, etc., in fast machine code.

Second, it also uses parallel computing, which shows its advantages when dealing with multiple time series.

Getting started and prerequisites

To install StatsForecast just run the following command if you want to install the package with pip:

pip install statsforecast

To install it with conda run:

conda install -c nixtla statsforecast

Data structure

StatsForecast needs to have the time series data in a particular structure:

+-----------+----------+-------+
| unique_id |    ds    |   y   |
+-----------+------------------+
|     0     | Jan 2021 |  100  |
|     0     | Feb 2021 |  200  |
|     0     | Mar 2021 |  150  |
+-----------+------------------+
  • An index column called unique_id
  • A ds column that contains the date or a numerical value
  • The y column which is the target variable of our univariate time series

The index column represents the index of the respective time series. In case you only work with one single time series, the index is always 0 or constant. If you have multiple time series in one data frame, the index is used to differentiate between them and enables parallel computing. For example, the first time series gets the index 0, the second the index 1, etc.

Forecasting with StatsForecast

Now that we are familiar with the needed data structure, let us start with modeling and Forecasting.

Australian total wine sales (single time series)

First of all, we have to bring the data in the right shape. The sales data comes as a series with the date (month year) as its index (line 8).

Since it is a single time series, we set the index for the whole data to 0, create a ds column for the date and a y column for the sales values (lines 11–17).

After this step, we convert the ds column to datetime (line 18). The total wine data is on a monthly level. When we convert it with datetime, __ we get a _year-month-da_y format always starting with the first day of the month. This might later lead to a problem when using the forecasting method (the outcome date would be always end of the month). That’s why we add here MonthEnd(1).

Last but not least, we split our data into train and test set. The final result of our data looks like this:

Figure 1. Excerpt of the transformed Australian total wine sales data set.
Figure 1. Excerpt of the transformed Australian total wine sales data set.

After we brought our data in the right shape, we can start initializing StatsForecast:

StatsForecast needs the following parameters:

  • The (training) data frame
  • The models you want to use as tuples with the model name and the respective parameters (auto_arima, 12)
  • The frequency (‘M’ for months, ‘Y’ for years, etc.)
  • Optional is the _njobs, which can be used to enable parallel computing when dealing with multiple time series.

As mentioned at the beginning, StatsForecast comes with a bunch of other statistical and econometric models. You can provide the models parameter (line 7) multiple ones as tuples with their particular parameters. A full list of all provided models can be found here.

In this example, we use the auto_arima model only and set the season length parameter to 12. The n_job parameter is set by default to 1. It only makes sense to increase it if we have multiple time series in our data set.

Now that we have configured our forecaster, we can predict the upcoming 26 months.

The computation time just took 2.78 s

StatsForecast’s predictions are returned as a data frame (figure 2). Each column (except the ds for the time) shows the forecasts produced by the defined model(s).

Figure 2. Excerpt of our StatsForecast prediction using auto_arima.
Figure 2. Excerpt of our StatsForecast prediction using auto_arima.

Now that we have our results let us calculate the Mean Absolut Error (MAE). A full overview of various time series error metrics can be found here.

We get a MAE of 1592.865459735577

Comparison to the pmdarima package

Let us now compare the performance and accuracy with the auto_arima function of the pmdarima package.

With the StatsForecast auto_arima approach we have a computational time of 86 seconds and a MAE of 1951.2410193026085

Comparing the performance of both packages and plotting the forecast results in a graph (figure 3), we can see that StatsForecast’s auto_arimaperforms 30 times faster and is more accurate than the pmdarima one.

Figure 3. 26 months forecast results.
Figure 3. 26 months forecast results.

M4 data (multiple time series)

Since we are now familiar with forecasting a single time series, we focus now on a more advanced example. The M4 data can be found here. For sake of simplicity and demonstration, we focus only on daily data for micro- economic time series (1.476 different time series). As mentioned above, the motivation of this section is to show you how to bring multiple time series in the right shape for StatsForecast.

We start by loading parts of the M4 data set and convert columns to rows:

The raw data (figure 4) will be converted in the following shape (figure 5).

Figure 4. Excerpt of loaded M4 daily data set.
Figure 4. Excerpt of loaded M4 daily data set.
Figure 5. Excerpt of converted M4 daily data set.
Figure 5. Excerpt of converted M4 daily data set.

After this step we add information about the data set (mainly its category) to our test and train data frames (lines 2–3). We also create a unique_id index that contains the category and the id of each time series (lines 5–6).

The outcome after this wrangling steps looks like this:

Figure 6. Excerpt of the training data set after manipulation.
Figure 6. Excerpt of the training data set after manipulation.

Unlike in our first example, our index has now multiple values. Each value represents a time series. This later helps the package to run parallel computing or forecasting.

Only one last wrangling step is missing: We have to add or create a ds column. We do this by running the following code:

The code adds to each time series a sequence of numbers (starting from 1), which represent the respective days. Remember: The ds-column can contain numerical values or dates. The outcome looks like this:

Figure 7. Excerpt of the training data set with ds-column.
Figure 7. Excerpt of the training data set with ds-column.

Now that we have all the data we need, we can go to our final modeling step.

We set the season length to 7 and the frequency to ‘D’ because we have daily data. Unlike our first example, we set the n_jobs not to 1 but to the number of different time series we want to forecast. The n_jobs is determined by the number of available cores and different time series. If there are more time series than available cores, the min function would return the number of available cores.

Executing the forecast took 3min 11s with a MAE of 3541.514063940959


Conclusion

StatsForecast’s performance is really impressive. The team is working hard on fixing bugs and implementing new features as you can see on their github page. Even if their current documentation is limited but in the making, you can find a lot of examples on their github page and a well written detailed article on towardsdatascience.

If I could make three wishes then I would love to have confidence and prediction intervals, model summary functions (providing more statistical information about the model(s)) and the integration of StatsForecast in other packages.

Further links

👉 StatsForecast’s github page

👉 TDS Article from the co-founder Federico Garza Ramírez

👉 Pmdarima’s github page

👉 Overview of time series error metrics


Related Articles