Unraveling Spline Regression in R

Published in

Towards Data Science

5 min readAug 15, 2019

When we talk about regression, the first things that come to our mind are linear or logistic regression and somewhere in the distant back of the mind polynomial regression. Linear and logistic regression are 2 of the most popular types of regression methods. However, there are many different types of regression methods which can prove to be useful in different scenarios. Today we will be looking at Spline Regression using Step Functions.

Spline Regression is a non-parametric regression technique. This regression technique divides the datasets into bins at intervals or points called knots and each bin has its separate fit. Let’s look at one simple implementation of Spline regression using step function in R.

Visualizing the dataset:

Quantity <- c(25,39,45,57,70,85,89,100,110,124,137,150,177)
Sales <- c(1000,1250,2600,3000,3500,4500,5000,4700,4405,4000,3730,3400,3300)
data <- data.frame(Quantity,Sales)
data

library(plotly)plot_ly(data,x=~Quantity,
        y=~Sales,
        type="scatter"
)

Let’s fit a linear regression on and see how it works:

fit <- lm(Sales ~ Quantity, data=data)
summary(fit)

plot_ly(data,x=~Quantity,
        y=~Sales,
        type="scatter") %>% add_lines(x =  ~Quantity, y = fitted(fit))

The equation here takes the form of:

In this case:

We can see that linear regression produces a terrible fit in this case, as seen from the plot above and the R-squared value.

Let’s now introduce a polynomial term (quadratic here) to the equation and analyze the performance of the model.

fit2 <- lm(Sales ~ poly(Quantity,2) + Quantity, data=data)
summary(fit2)

plot_ly(data,x=~Quantity,
        y=~Sales,
        type="scatter") %>% add_lines(x =  ~Quantity, y = fitted(fit2))

The equation here takes the form of:

In this case:

We can see that it’s not a bad fit but not a great one either. The predicted apex is somewhat far from the actual apex. Polynomial regression also comes with various disadvantages that it tends to overfit. It can lead to an increase in complexity as the number of features increases.

The disadvantages of the polynomial regression and incompetence of the linear model can be overcome by using Spline Regression.

Let us visualize the dataset by dividing it into two bins. One on the left side of the peak that occurs at Quantity = 89 and the other at its right side, as shown in the two images below, respectively.

Now let’s combine the above two images into one equation and perform piecewise regression or spline regression using step function.

The equation would take the form of:

In this case:

Xbar here is called the Knot value.

data$Xbar <- ifelse(data$Quantity>89,1,0)
data$diff <- data$Quantity - 89
data$X <- data$diff*data$Xbar

data

After performing the above manipulation the data would look like this:

Let us now fit the equation we saw above:

The X in the equation below is (x-xbar)*Xk

reg <- lm(Sales ~ Quantity + X, data = data)


plot_ly(data,x=~Quantity,
        y=~Sales,
        type="scatter") %>% add_lines(x =  ~Quantity, y = fitted(reg))

summary(reg)

As we can see from the plot and the R-squared values above, spline regression produces a much better result, in this scenario.

The above results can also be obtained using Segmented package in R:

library(segmented)

fit_seg <- segmented(fit, seg.Z = ~Quantity, psi = list(Quantity=89))

plot_ly(data,x=~Quantity,
        y=~Sales,
        type="scatter") %>% add_lines(x =  ~Quantity, y = fitted(fit_seg))

Note: If you are not providing the breakpoint value (Quantity = 89, here), then use “psi = NA”

summary(fit_seg)

Both methods produce the same result.

This was one simple example of spline regression. Splines can be fitted using polynomials functions as well, called Polynomial Splines, so instead of fitting a high-degree polynomial for the entire range of X, splines or piecewise polynomial regression with lower degree polynomials can be fit in sperate regions of X.

CHOOSING THE LOCATION AND NUMBER OF THE KNOTS

Splines can be modelled by adding more number of knots thereby increasing the flexibility of the model. In general, placing K knots lead to the fitting of K + 1 functions. The choice of placing a knot may depend on various factors. Since regression is highly flexible in areas where there are more knots placed, it’s intuitive to place knots where there is more variation in the data or where the function changes more rapidly. The regions which seem comparatively stable need not have too many knots and can use fewer of them.

CONCLUSION:

We learned about Spline regression using step function in this article. There are other kinds of polynomial functions that can be applied too. One of the common ones is the cubic spline which uses a polynomial function of the third order. Yet another method of implementing splines is Smoothing Splines. Splines often provide better results as compared to polynomial regression. In splines, flexibility can be increased by increasing the number of knots and without increasing the degree of the polynomial. They also produce more stable results as compared to polynomial regression, in general.

I hope this article was useful in grabbing the idea of Spline and Piecewise Regression and getting started with it.

REFERENCES:

[1]Splines in Regression By Andrew Wheeler. http://utdallas.edu/~Andrew.Wheeler/Splines.html

[2]How to Develop a Piecewise Linear Regression Model in R by Shokoufeh Mirzaei. https://www.youtube.com/watch?v=onfXC1qe7LI

[3]Breakpoint analysis, segmented regression. https://rpubs.com/MarkusLoew/12164

[4]Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning: with Applications in R. New York: Springer, 2013.

Unraveling Spline Regression in R

Written by Trisha Chandra