The world’s leading publication for data science, AI, and ML professionals.

One thing you might not have known about linear regression

How to train a single model with multiple outputs

Photo by NeONBRAND on Unsplash
Photo by NeONBRAND on Unsplash

Probably in over 95% of cases of using linear regression (at least in the ones covered in mainstream media), we have a set of features (X) and a column of targets (y) which we want to predict. However, linear regression is one of the algorithms capable of handling multiple outputs at the same time. Neural networks can do that as well, while decision trees (and models based on ensembles of trees) cannot.

In this short article, we show how to use multi-output linear regression in a simple example of time series forecasting – we predict multiple steps ahead using the same set of features. However, multi-output regression can also be used for different tasks. For example, when we want to predict two series of economic indicators with one set of features, as we believe they are relevant for both of them.

Example in Python

Setup

As always, we start by importing the required libraries.

Data

Then, we load the famous airlines passengers data set, which is easily available in the seaborn library. This is a very popular data set, so we will just mention that we have 12 years of monthly data for our disposal.

Image by author
Image by author

We can already see that there is a clear, increasing trend (probably multiplicative, as the variation seems to increase over time) and some strong yearly seasonality.

Preparing the features and targets

We need to remember that this is a very simplified example, so we will not spend too much time on feature engineering. At this point, we have to define the features and the targets:

  • given that there seems to be a yearly seasonal pattern present in the data, we will use 12 separate lags as features,
  • we will predict the number of airline passengers over the next 3 months.

    The data looks as follows:

Image by author
Image by author

Naturally, there is a lot of missing values, given we cannot create lags/leads from unavailable time periods. That is why we drop the rows with missing values.

Then, we separate the features from the targets. This is the moment that is different from the most common tasks – the target actually consists of 3 columns. Having done so, we split the data into a training and test set. We use 20% of observations for the test set. Please remember that when working with time series, we should not shuffle the observations while splitting the data.

Training the model

Training the model looks exactly the same as in the case of a single-output model.

Inspecting the results

We will not evaluate the accuracy of our predictions, as we did not put too much effort into building an accurate model. Instead, we focus on investigating how the predictions look like in a plot. First, we need to create a simple helper function for visualizing multi-output series.

We first look at the ground truth.

plot_multistep_forecast(y_test, "Actuals", freq="M");
Image by author
Image by author

As expected, there is a lot of overlap over consecutive period. That is because the values do not change for the same period, regardless of it being considered as horizon 1 or 3. Now, let’s take a look at the forecasts, which will most likely differ a lot.

plot_multistep_forecast(y_pred, "Forecasts", freq="M");
Image by author
Image by author

To be honest, it is not that bad! We can clearly see the same patterns as in the actuals above, though there is some variation. For example, for pretty much all dates, the forecast changes (sometimes by quite a lot) depending on which horizon we are looking at. But for this simple exercise, the results are definitely good enough.

Takeaways

  • some algorithms are able to predict multiple outputs simultaneously, linear regression is one of those,
  • for the simplest implementation possible, we just need to have more columns in the target.

You can find the code used for this article on my GitHub. Also, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!

You might also be interested in one of the following:

Verifying the Assumptions of Linear Regression in Python and R

Interpreting the coefficients of linear regression

Phik (𝜙k) – get familiar with the latest correlation coefficientTh_at is also consistent between categorical, ordinal, and interval variables!to_wardsdatascience.com


Related Articles