The world’s leading publication for data science, AI, and ML professionals.

A Guide to Live Inference with a Forecasting Model

Beyond offline training and testing predictions

Photo by Fringer Cat on Unsplash
Photo by Fringer Cat on Unsplash

There are many online resources about using machine learning for forecasting. Yet, these focus on doing offline training and testing predictions.

Here, you’ll learn how to create a model and use it to forecast actual future observations.


Introduction

Forecasting resources often overlook the application of models for live predictions.

There is plenty of information on applying machine learning to forecasting. But, most of it is focused on a specific stage of the forecasting life-cycle. For example, data pre-processing or model building.

These resources often lack information on how a model is applied in practice. That is, how to extend it from an offline setting to a live one.

This aspect is especially important in time series because observations are correlated and their order matters.

Let’s see how we can build a model and use it to make live forecasts.


Case Study – Forecasting Waves’ Height

Photo by Silas Baisch on Unsplash
Photo by Silas Baisch on Unsplash

We’ll use a time series about the height of ocean waves. Forecasting this kind of data is important to manage maritime operations.

The dataset is captured by a smart buoy that is placed on the coast of Ireland. You can check reference [1] for the details.

This time series is continually updated with new observations. So, it’s a perfect example to develop a forecasting model for live predictions.

We’ll cover the following steps:

  1. Getting historical data;
  2. Pre-processing and feature engineering;
  3. Selecting and building a forecasting model;
  4. Getting the latest observations and making forecasts.

Getting historical data

Let’s start by getting the data.

You can read it directly from ERDAP’s server as follows:

import pandas as pd

START_DATE = '2022-01-01'
URL = f'https://erddap.marine.ie/erddap/tabledap/IWaveBNetwork.csv?time%2CSignificantWaveHeight&time%3E={START_DATE}T00%3A00%3A00Z&station_id=%22AMETS%20Berth%20B%20Wave%20Buoy%22'

def reading_data(url: str) -> pd.Series:
    """
    Reading ERDAP data

    :param url: ERDAP url as string
    :return: hourly wave height time series as pd.Series
    """

    # reading data directly from erdap
    data = pd.read_csv(url, skiprows=[1], parse_dates=['time'])

    # setting time to index and getting the target series
    series = data.set_index('time')['SignificantWaveHeight']

    # transforming data to hourly and from centimeters to meters
    series_hourly = series.resample('H').mean() / 100

    return series_hourly

series = reading_data(URL)

Here’s the time series plot:

Hourly ocean wave height time series. Source in reference [1]. Image by author.
Hourly ocean wave height time series. Source in reference [1]. Image by author.

Pre-processing and feature engineering

Before modeling, you may need to do some pre-processing.

For illustration purposes, we’ll do two things:

  • Take the log of the data. This helps stabilize the variance;
  • Feature extraction using summary statistics.

Here’s the code for both of these operations:

import numpy as np

class LogTransformation:
    """
    Log transformation and inverse transformation

    Taking the log helps stabilize the variance
    """

    @staticmethod
    def transform(x):
        xt = np.sign(x) * np.log(np.abs(x) + 1)

        return xt

    @staticmethod
    def inverse_transform(xt):
        x = np.sign(xt) * (np.exp(np.abs(xt)) - 1)

        return x

The class above also includes an _inversetransform method to revert the log transformation. This is important to revert the log forecasts back to their original scale.

def feature_engineering(X: pd.DataFrame) -> pd.DataFrame:
    """
    param X: lagged observations (explanatory variables)

    :return: new features
    """

    summary_stats = {'mean': np.mean, 'sdev': np.std}

    features = {}
    for f in summary_stats:
        features[f] = X.apply(lambda x: summary_stats[f](x), axis=1)

    features_df = pd.concat(features, axis=1)
    X_feats = pd.concat([X, features_df], axis=1)

    return X_feats

The function _featureengineering computes a rolling average and a rolling standard deviation. These are two examples of some of the features that you can compute to improve forecasting models.

Our goal is to show how different preprocessing steps fit into the pipeline for live Inference. In your case, you want to make sure these or other transformations are needed. For example, using cross-validation or statistical tests.

Building a forecasting model

The next step is to build a model and estimate its performance.

We start by splitting the data into training and testing sets. Then, we transform these for auto-regression using a sliding window. You can check a previous article to know more about supervised learning with time series.

from sklearn.model_selection import train_test_split
# https://github.com/vcerqueira/blog
from src.tde import time_delay_embedding

# using last 24 observations as lags, 
# and next 24 observations as the forecasting horizon
N_LAGS, HORIZON = 24, 24

train, test = train_test_split(series, test_size=0.2, shuffle=False)

X_train, Y_train = time_delay_embedding(train, n_lags=N_LAGS, horizon=HORIZON, return_Xy=True)
X_test, Y_test = time_delay_embedding(test, n_lags=N_LAGS, horizon=HORIZON, return_Xy=True)

Using this setup, we’ll build a model to forecast the next 24 hours based on the previous 24 lags.

There are several techniques for building forecasting models. Here, we’ll focus on a Random Forest. This includes hyperparameter tuning using cross-validation. Here’s how you can do it:

from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit

## apply preprocessing steps
# log transformation
X_train = LogTransformation.transform(X_train)
Y_train = LogTransformation.transform(Y_train)
# feature engineering
X_train_ext = feature_engineering(X_train)

# time series cv procedure
tscv = TimeSeriesSplit(n_splits=5, gap=50)

# defining the search space
# a simple optimization of the number of trees of a RF
model = RandomForestRegressor()
param_search = {'n_estimators': [10, 50, 100, 200],
                'criterion': ['squared_error', 'absolute_error'],
                'max_depth': [None, 2, 5, 10],
                'max_features': ['log2', 'sqrt']}

# applying CV with random search on the training data
gs = RandomizedSearchCV(estimator=model,
                        cv=tscv,
                        refit=True,
                        param_distributions=param_search,
                        n_iter=10, n_jobs=1)

gs.fit(X_train_ext, Y_train)

After optimizing the parameters, you can apply the model to test data. This provides a reliable performance estimate.

# applying preprocessing steps to test data
X_test = LogTransformation.transform(X_test)
X_test_ext = feature_engineering(X_test)

# inference on test set and evaluation
preds = gs.predict(X_test_ext)

# log forecasts
preds_log = gs.predict(X_test_ext)

# reverting the log transformation
preds = LogTransformation.inverse_transform(preds_log)

# estimating performance using r-squared
estimated_performance = r2_score(Y_test, preds)

Note that hyperparameter optimization and performance estimation are done in two separate parts of the data.

The chosen model is re-trained with all available data. You can also store it in a file using joblib:

from joblib import dump

# preparing all available data for auto-regression
X, Y = time_delay_embedding(series, n_lags=N_LAGS, horizon=HORIZON, return_Xy=True)

# applying preprocessing steps
X = LogTransformation.transform(X)
Y = LogTransformation.transform(Y)
X_ext = feature_engineering(X)

# model fitting
final_model = RandomForestRegressor(**gs.best_params_)
final_model.fit(X_ext, Y)

dump(final_model, 'random_forest_v1.joblib')

Applying the model

At this stage, we accomplished several things:

  • Read and pre-processed the time series;
  • Built and optimized a forecasting model using cross-validation;
  • Estimated its performance using a test set.

Now, we’re ready to apply this model in the wild.

Let’s start by getting the latest observations from the buoy.

import datetime

# setting the max history to yesterday
yesterday = datetime.date.today() - datetime.timedelta(days=1)
yesterday = yesterday.strftime('%Y-%m-%d')

LIVE_URL = f'https://erddap.marine.ie/erddap/tabledap/IWaveBNetwork.csv?time%2CSignificantWaveHeight&time%3E={yesterday}T00%3A00%3A00Z&station_id=%22AMETS%20Berth%20B%20Wave%20Buoy%22'

# reading the data from the ERDAP server
new_series = reading_data(LIVE_URL)

# getting the last 24 observations needed for the model
lags = new_series.tail(N_LAGS)

Why do we need these observations?

Recall that our model is based on auto-regression. This means that it uses recent past observations to predict future ones. In our example, we use the past 24 observations to predict the next 24 hours of data. So, the input to the model is based on the past 24 recent observations.

We need to restructure this data before applying the model. This means applying the same transformations we did for training the final model. Then, we load and apply the model to this sample.

from joblib import load

# structuring the lags as a DataFrame
lags_df = pd.DataFrame(lags).T.reset_index(drop=True)
lags_df.columns = X.columns

# applying preprocessing steps
lags_df = LogTransformation.transform(lags_df)
lags_feats = feature_engineering(lags_df)

# loading the model from disk
final_model = load('random_forest_v1.joblib')

# applying the model
log_forecasts = final_model.predict(lags_feats)

# reverting the log transformation
forecasts = LogTransformation.inverse_transform(log_forecasts)

Here’s how the forecasts look like:

Forecasts of a Random Forest (in navy blue) for the next 24 hours. Image by author.
Forecasts of a Random Forest (in navy blue) for the next 24 hours. Image by author.

The forecasts of individual trees are also included to convey forecasting uncertainty.

Since these are actual forecasts, we need to wait to check how the model has performed.


Key Takeaways

In this article, we built a model and used it to make forecasts in a real-world scenario.

We explored:

  • how to get the latest observations and structure them for an auto-regression model;
  • how to store and load a model;
  • how to apply and revert transformations to get forecasts on the original data scale.

Thanks for reading, and see you in the next story!

Related Articles

References

[1] Irish Wave Buoys from Marine Institute (Dataset ID: IWaveBNetwork). License: CC BY 4.0


Related Articles