The world’s leading publication for data science, AI, and ML professionals.

MARS: Multivariate Adaptive Regression Splines – How to Improve on Linear Regression?

A visual explanation of the MARS algorithm with Python examples and comparison to linear regression

Hands-on Tutorials, Machine Learning

Model prediction comparison between MARS and Linear Regression. Image by author.
Model prediction comparison between MARS and Linear Regression. Image by author.

Intro

Machine Learning is making huge leaps forward, with an increasing number of algorithms enabling us to solve complex real-world problems.

This story is part of a deep dive series explaining the mechanics of Machine Learning algorithms. In addition to giving you an understanding of how ML algorithms work, it also provides you with Python examples to build your own ML models.

Before we dive into the specifics of MARS, I assume that you are already familiar with Linear Regression. If you would like a refresher on the topic, feel free to explore my linear regression story:

Linear regression made easy. How does it work and how to use it in Python?

This story covers the following topics:

  • What category of algorithms does MARS belong to?
  • How does the MARS algorithm work, and how does it differ from linear regression?
  • How can I use MARS to build a prediction model in Python?

What category of algorithms does MARS belong to?

Looking at the algorithm’s full name – Multivariate Adaptive Regression Splines – you would be correct to guess that MARS belongs to the group of regression algorithms used to predict continuous (numerical) target variables.

Regression itself is part of the supervised Machine Learning category that uses labeled data to model the relationship between data inputs (independent variables) and outputs (dependent variables).

The below graph is interactive, so make sure to click on different categories to enlarge and reveal more👇 .

If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.

You can use multivariate adaptive regression splines to tackle the same problems that you would use linear regression for, given they both belong to the same group of algorithms. A few examples of such problems would be:

  • Estimating the price of an asset based on its characteristics
  • Predicting home energy consumption based on time of day and outside temperature
  • Estimating inflation based on interest rates, money supply, and other macroeconomic indicators

While the list can go on forever, remember, regression algorithms are there to help you when you have a numerical target variable.

How does the MARS algorithm work, and how does it differ from linear regression?

The basics

The beauty of linear regression is its simplicity, as it assumes a linear relationship between inputs and outputs (except for polynomial regression, a special case of multiple linear regression, used to model non-linear relationships).

However, the interaction between metrics in the real-world is often non-linear, which means that simple linear regression cannot always give us a good approximation of outputs given the inputs. This is where MARS comes to the rescue.

The simplest way to think about MARS is to imagine it as an ensemble of linear functions joined together by one or more hinge functions.

Hinge function: 
h(x-c) = max(0, x-c) = {x−c, if x>0; and 0, if x≤c}, 
where c is a constant also known as a knot

The result of combining linear hinge functions can be seen in the example below, where black dots are the observations, and the red line is a prediction given by the MARS model:

Example - using MARS to predict y values given x. Image by author.
Example – using MARS to predict y values given x. Image by author.

It is clear from this example that simple linear regression would fail to give us a meaningful prediction as we would not be able to draw one straight line across the entire set of observations. Simultaneously, polynomial regression would also struggle with this task because of the "sharp angles" seen in the data plot.

However, the MARS algorithm does very well since it can combine a few linear functions using "hinges."

The equation for the above example:
y= -14.3953 + 1.99032 * max(0, 4.33545 - x) + 2.00966 * max(0, x + 9.95293)

The process

The algorithm has two stages: the forward stage and the backward stage.

It generates many candidate basis functions in the forward stage, which are always produced in pairs, i.e., h(x-c) and h(c-x). However, a generated pair of functions is only added to the model if it reduces the overall model’s error. Typically, you can control the max number of functions that the model generates with a hyperparameter.

The backward stage, a.k.a. pruning stage, goes through functions one at a time and deletes the ones that add no material performance to the model. This is done by using a generalized cross-validation (GCV) score. Note, GCV score is not actually based on cross-validation and is only an approximation of true cross-validation score, aiming to penalize model complexity.

The result is several linear functions that can be written down in a simple equation like in the example used above.

How can I use MARS to build a prediction model in Python?

Since you now have a general understating of how the algorithm works, it is time to have some fun and build a couple of prediction models in Python.

We will use the following:

  • House price data from Kaggle
  • Scikit-learn library to build linear regression models (so we can compare its predictions to MARS)
  • py-earth library to build MARS models
  • Plotly library for visualizations
  • Pandas and Numpy

Setup

Note that the py-earth package is only compatible with Python 3.6 or below at the time of writing. If you are using Python 3.7 or above, I suggest you create a virtual environment with Python v.3.6 to install py-earth.

Let us start by importing the required libraries.

import pandas as pd # for data manipulation
import numpy as np # for data manipulation
from sklearn.linear_model import LinearRegression # for building a linear regression model
from pyearth import Earth # for building a MARS model
import plotly.graph_objects as go # for data visualization
import plotly.express as px # for data visualization

Next, we download and ingest the data that we will use to build our MARS and linear regression models. (source: https://www.kaggle.com/quantbruce/real-estate-price-prediction?select=Real+estate.csv)

# Read in data
df = pd.read_csv('Real estate.csv', encoding='utf-8')
# Print dataframe
df
House price data from Kaggle. Image by author.
House price data from Kaggle. Image by author.

MARS vs. simple linear regression – 1 independent variable

Let us take ‘X3 distance to the nearest MRT station’ as our input (independent) variable and ‘Y house price of unit area’ as our output (dependent, a.k.a. target) variable.

Before we build the models, however, we will create a scatter plot to visualize the data.

# Create a scatter plot
fig = px.scatter(df, x=df['X3 distance to the nearest MRT station'], y=df['Y house price of unit area'], 
                 opacity=0.8, color_discrete_sequence=['black'])

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey', 
                 zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey', 
                 showline=True, linewidth=1, linecolor='black')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey', 
                 zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey', 
                 showline=True, linewidth=1, linecolor='black')

# Set figure title
fig.update_layout(title_text="Scatter Plot")

# Update marker size
fig.update_traces(marker=dict(size=3))

fig.show()
Scatterplot of X and Y. Image by author.
Scatterplot of X and Y. Image by author.

Looking at the graph above, we can clearly see the relationship between the two variables. The price of a house unit area decreases as the distance from the nearest MRT station increases.

Now, let’s build multivariate adaptive regression splines and simple linear regression models and compare their predictions.

# --- Select variables to use in the two models --- 
# Note, we need X to be a 2D array, hence reshape
X=df['X3 distance to the nearest MRT station'].values.reshape(-1,1)
y=df['Y house price of unit area'].values

# --- Define and fit the two models ---
model1 = LinearRegression() 
model2 = Earth(max_terms=500, max_degree=1) # note, terms in brackets are the hyperparameters 

LR = model1.fit(X, y)
MARS = model2.fit(X, y)

# --- Print model summary ---
# LR
print("Simple Linear Regression Model")
print("--------------------------------------")
print("Intercept: ", LR.intercept_)
print("Slope: ", LR.coef_)

print("")
print("<><><><><><><><><><><><><><><><><><><>")
print("")

# MARS
print(MARS.summary())
Summary statistics for MARS and linear regression models. Image by author.
Summary statistics for MARS and linear regression models. Image by author.

As you can see, the MARS model added two hinge functions in the forward stage, but then it pruned h(x0–1146.33) from the model in the backward stage. Hence, the final equations for the two models are:

Simple linear regression model:
y = 45.85142705777498 - 0.00726205 * x
MARS model:
y = 31.4145 + 0.0184597 * h(1146.33 - x) - 0.00269698 * x = 
  = 31.4145 + 0.0184597 * max(1146.33 - x, 0) - 0.00269698 * x

Let us now plot them both on one graph so we can see how they differ.

# ----- Create data for Linear Regression line -----
# Create 20 evenly spaced points from smallest X to largest X
x_range = np.linspace(X.min(), X.max(), 20) 

# Predict y values for our set of X values
y_range = model1.predict(x_range.reshape(-1, 1))

# ----- Create data for MARS model line -----
# Select a few points including the 3 major ones: min, max and hinge
x_mars = np.array([X.min(), 1000, 1146.33, 2000, 3000, 4000, 5000, 6000, X.max()])

# Predict y values for our set of X values
y_mars = model2.predict(x_mars.reshape(-1, 1))

# ----- Create a scatter plot -----
fig = px.scatter(df, x=df['X3 distance to the nearest MRT station'], y=df['Y house price of unit area'], 
                 opacity=0.8, color_discrete_sequence=['black'])

# Add a best-fit line
fig.add_traces(go.Scatter(x=x_range, y=y_range, name='Linear Regression', line=dict(color='limegreen')))
fig.add_traces(go.Scatter(x=x_mars, y=y_mars, name='MARS', line=dict(color='red')))

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey', 
                 zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey', 
                 showline=True, linewidth=1, linecolor='black')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey', 
                 zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey', 
                 showline=True, linewidth=1, linecolor='black')

# Set figure title
fig.update_layout(title_text="Linear Regression vs. MARS")

# Update marker size
fig.update_traces(marker=dict(size=3))

fig.show()
Linear regression and MARS model comparison. Image by author.
Linear regression and MARS model comparison. Image by author.

Note the kink at x=1146.33. This is where the hinge function h(c-x) becomes zero, and the line changes its slope. The graph makes it very intuitive to understand how MARS can better fit the data using hinge functions.

MARS vs. multiple linear regression – 2 independent variables

Let us now go up in dimensions and build and compare models using 2 independent variables.

We start by creating a 3D scatterplot with our data. Note, we use the same data as before but add one more independent variable – ‘X2 house age’.

# Create a 3D scatter plot
fig = px.scatter_3d(df, x=df['X3 distance to the nearest MRT station'], y=df['X2 house age'], z=df['Y house price of unit area'], 
                 opacity=0.8, color_discrete_sequence=['black'])

# Set figure title
fig.update_layout(title_text="Scatter 3D Plot",
                  scene = dict(xaxis=dict(backgroundcolor='white',
                                          color='black',
                                          gridcolor='lightgrey'),
                               yaxis=dict(backgroundcolor='white',
                                          color='black',
                                          gridcolor='lightgrey'
                                          ),
                               zaxis=dict(backgroundcolor='white',
                                          color='black', 
                                          gridcolor='lightgrey')))

# Update marker size
fig.update_traces(marker=dict(size=3))

fig.show()
Observations visualized with Plotly 3D scatterplot. Image by author.
Observations visualized with Plotly 3D scatterplot. Image by author.

We can see that while somewhat weaker, there is also a relationship between X2 and Y as the price increases when the house age decreases.

Let us now fit multivariate adaptive regression splines and linear regression models.

# ----- Select variables that we want to use in a model -----
# Note, X in this case is already a 2D array, hence no reshape
X=df[['X3 distance to the nearest MRT station','X2 house age']]
y=df['Y house price of unit area'].values

# ----- Define and fit the two models -----
model1 = LinearRegression()
model2 = Earth()

LR = model1.fit(X, y)
MARS = model2.fit(X, y)

# ----- Print model summary -----
# LR
print("Simple Linear Regression Model")
print("--------------------------------------")
print("Intercept: ", LR.intercept_)
print("Slope: ", LR.coef_)

print("")
print("<><><><><><><><><><><><><><><><><><><>")
print("")

# MARS
print(MARS.summary())
Summary statistics for MARS and linear regression models. Image by author.
Summary statistics for MARS and linear regression models. Image by author.

Since we increased the number of dimensions, we now have two slope parameters in a linear regression model (one for each x). We also have 4 hinge functions that have been added to the MARS model using both independent variables.

Let us plot two graphs to visualize the results, one for multiple linear regression and another for multivariate adaptive regression splines. But before that, we need to generate a mesh with a range of input values and predict output values. This will give us the data for our two graphs.

# Increments between points in a meshgrid
mesh_size = 5

# Identify min and max values for input variables
x_min, x_max = X['X3 distance to the nearest MRT station'].min(), X['X3 distance to the nearest MRT station'].max()
y_min, y_max = X['X2 house age'].min(), X['X2 house age'].max()

# Return evenly spaced values based on a range between min and max
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)

# Create a meshgrid
xx, yy = np.meshgrid(xrange, yrange)

# Predict using LR model
pred_LR = model1.predict(np.c_[xx.ravel(), yy.ravel()])
pred_LR = pred_LR.reshape(xx.shape)

# Predict using MARS model
pred_MARS = model2.predict(np.c_[xx.ravel(), yy.ravel()])
pred_MARS = pred_MARS.reshape(xx.shape)

# Note, .ravel() flattens the array to a 1D array,
# then np.c_ takes elements from flattened xx and yy arrays and puts them together,
# this creates the right shape required for model input

# prediction array that is created by the model output is a 1D array,
# we need to reshape it to be the same shape as xx or yy to be able to display it on a graph

Now that we have the data ready, let us draw the graphs.

# Create a 3D scatter plot with predictions
fig = px.scatter_3d(df, x=df['X3 distance to the nearest MRT station'], y=df['X2 house age'], z=df['Y house price of unit area'], 
                 opacity=0.8, color_discrete_sequence=['black'])

# Set figure title and colors
fig.update_layout(title_text="Scatter 3D Plot with Linear Regression Prediction Surface",
                  scene = dict(xaxis=dict(backgroundcolor='white',
                                          color='black',
                                          gridcolor='lightgrey'),
                               yaxis=dict(backgroundcolor='white',
                                          color='black',
                                          gridcolor='lightgrey'
                                          ),
                               zaxis=dict(backgroundcolor='white',
                                          color='black', 
                                          gridcolor='lightgrey')))
# Update marker size
fig.update_traces(marker=dict(size=3))

# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_LR, name='LR'))

fig.show()
Multiple linear regression with 2 independent variables. Image by author.
Multiple linear regression with 2 independent variables. Image by author.
# Create a 3D scatter plot with predictions
fig = px.scatter_3d(df, x=df['X3 distance to the nearest MRT station'], y=df['X2 house age'], z=df['Y house price of unit area'], 
                 opacity=0.8, color_discrete_sequence=['black'])

# Set figure title and colors
fig.update_layout(title_text="Scatter 3D Plot with MARS Prediction Surface",
                  scene = dict(xaxis=dict(backgroundcolor='white',
                                          color='black',
                                          gridcolor='lightgrey'),
                               yaxis=dict(backgroundcolor='white',
                                          color='black',
                                          gridcolor='lightgrey'
                                          ),
                               zaxis=dict(backgroundcolor='white',
                                          color='black', 
                                          gridcolor='lightgrey')))
# Update marker size
fig.update_traces(marker=dict(size=3))

# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_MARS, name='MARS', ))
                          #colorscale=px.colors.sequential.Viridis))

fig.show()
Multivariate adaptive regression splines with 2 independent variables. Image by author.
Multivariate adaptive regression splines with 2 independent variables. Image by author.

It is easy to see the difference between the two models. Multiple linear regression creates a prediction plane that looks like a flat sheet of paper. Meanwhile, MARS takes that sheet of paper and folds it in a few places using hinge functions, enabling a better fit to the data.

If you wondered what that feature image represented at the beginning of the story, you should now be able to see that it overlays the predictions from linear regression and MARS models to help you see how the prediction outputs differ using each model.

Conclusion

Multivariate adaptive regression splines algorithm is best summarized as an improved version of linear regression that can model non-linear relationships between the variables. While I demonstrated examples using 1 and 2 independent variables, remember that you can add as many variables as you like.

I hope you found this story useful and that you will put what you learned into practice by building and improving your own regression models.

Feel free to reach out if you have any feedback or questions.

Cheers! 👏 Saul Dobilas


Related stories you may like:

LOWESS Regression in Python: How to Discover Clear Patterns in Your Data?

Support Vector Regression (SVR) – One of the Most Flexible Yet Robust Prediction Algorithms


Related Articles