Splines in Python for Feature Selection and Data Smoothing

So this week I ended up doing some work with Splines in Python and was shocked regarding the state of information and lack of support articles for new-comers to Splines with Python. There is plenty of information on the math already. The goal of this article is to break down the application of that theory for B-Splines and Smoothing Splines.

Please consider giving me a follow to support future articles if this is helpful.

Outline

What is a Spline?
How are Splines Used?
Walkthrough 1: B-Splines from scratch for the most flexible but mathematically hands-on approach
Walkthrough 2: splrep from SciPy for easy Smoothing B-Spines
Summarizing Splines for Feature Extraction
Conclusion

WHAT IS A SPLINE?

A Spline is essentially a piecewise regression line. Trying to fit one regression line over a very dynamic set of data can let to a lot of compromise. You can tailor your line to fit one area well, but then can often suffer from overfitting in other areas as a consequence. Instead, we break up the observation into different "knots" and fit a different regression line on each segment divvied up by these knots or division points.

HOW ARE SPLINES USED?

Generally when we are looking at the input for a spline, we are using a 1D list or array. This could be we took the heart rate for one patient and recorded it every second for 30 seconds. This creates 1 observation with 30 features, or the heart rate recorded at each of those 30 seconds. We will then fit a Spline to this one observation detailing our knots (t) and order (k), which will return a line of best fit that has its own coefficients we can leverage as is or to make predictions with as well.

If our goal was simply to smooth the data and remove noise, we could:

Fit/train the spline on the observation
Extract the heart rate by feeding in the exact times we would like to predict the heart rate for (in this example, a list/array of range 0:30)
Save this output as the new data for each of our 30 features

If, instead, our goal was to reduce the dimensionality of our data and do feature selection, we could:

Fit/train the spline on the observation
Extract the coefficients for our spline (Beta values across each knot) and save these as our new feature data (number of coefficients depends on the complexity of your spline, but you could reduce your features from 30 to 8 just as an example)
Repeat steps 1 & 2 for multiple observations (multiple patients we have tracked the heart for over 30 seconds)
Run a new model such as Regression or Random Forest on this new data frame of n number of patients as rows and m number of extracted coefficients as the columns

That’s enough mumbo jumbo, so let’s see it in action…

WALKTHROUGH 1: B-SPLINES FROM SCRATCH FOR THE MOST FLEXIBLE BUT MATHEMATICALLY HANDS-ON APPROACH

First, this is our function to evenly distribute the locations of our knots (and account for buffer knots depending on the degree chosen) as we go to set the basis for our splines.

def knot_points(nKnots, x, degree):
#create the knot locations    
knots = np.linspace(x[0], x[-1], nKnots) 

    lo = min(x[0], knots[0]) #we have to add these min and values to   conform by adding preceding and proceeding values
    hi = max(x[-1], knots[-1])
    augmented_knots = np.append(np.append([lo]*degree, knots), [hi]*degree)
    return augmented_knots

loo = LeaveOneOut()

The data we are using is a series of radio signals. Each observation is a distinct radio signal with each feature being the amplitude of the signal at a given point in time.

x1 = pd.read_csv("data.csv", header=None)

y = np.array(x1.values[0,:])
x = np.array(range(0,len(x1[0]))

Y and x should both have a shape of (51,), equal to the number of features/columns. Array x represents the number of features which will be useful to specify the output for our splines and graphing later on. Array y represents one observation with its respective amplitude measurements.

So let’s get started!

nknots = 8
degree = 3
k=degree
DOF = nknots + degree +1
augmented_t = knot_points(nknots, x, degree)

We set a couple of variables we will use (DOF = Degrees of Freedom). Below we start to fit our Spline manually.

bs2 = BSpline(augmented_t, np.eye(DOF), degree, extrapolate=False) #setting up the basis
B = bs2(x)[:,:-2] #Creating the basis for x &amp; getting rid of extra column of zeroes from padding for order
# Least square estimation
x_proj_matrix = B@np.linalg.inv(B.T@B)@B.T
coeff = np.linalg.lstsq(B, y.T,rcond=-1)[0].T
yhat = B@coeff

n = x.shape[0]
K = np.trace(x_proj_matrix)
sigma2 = (1/(n-K))*(y-yhat).T@(y-yhat) #SSE/(n-p)
width = np.diag(sigma2*x_proj_matrix)**0.5
y_neg_ci = yhat-4*width
y_pos_ci = yhat+4*width

We fit the basis to our x and y to calculate the coefficients, which we can then use to give a prediction. We can also use the X Projection Matrix to directly calculate the covariances and give us our confidence intervals that we see in the image below.

So we see some of the math used to create these splines. While calculating this by hand does not seem to be the most fun, you can begin to easily extract other powerful information such as covariances to create these confidence intervals around our trend line.

WALKTHROUGH 2: SPLREP FROM SCIPY FOR EASY SMOOTHING B-SPINES

Let’s look at Example 2 now: Walking through an example using Smoothing Splines. These are a little more complicated as they contain a smoothing hyperparameter that balances variance and bias. We can use cross validation with leave one out to find MSE and choose the optimal smoothing parameter here. The benefit is that this function will automatically setup the knot array for us and return it too.

lambs = np.array([0.001,0.005, 0.1,0.25,0.5])
for i in lambs:

error = []
    for trg, tst in loo.split(x):
        spl = splrep(x[trg], y[trg],s=i)
        pred = splev(x[tst],spl)[0]
        true = y[tst][0]
        error.append((pred - true)**2)
    mse = mean(error)

Our best lambda (smoothing parameter) turned out to be 0.005.

spl = splrep(x, y, s=0.005)
y_hat = splev(x, spl)

Something that is also useful is calling a fitted splrep model as below, which gives us a nice clean output of the self-generated knots array, coefficients, and the power/order of the function.

spl

Now, if we were going to apply all of this above for smoothing our data we could loop through each observation/row, fit a spline on the resultant x and y arrays for that iteration in the loop, and then predict the new features.

Summarizing Splines for Feature Extraction

Now that ends the examples for smoothing. I will briefly summarize feature extraction here and go more in depth in my next article. Essentially, instead of making a prediction of the new feature values, we will just output the Betas/Coefficients produced by the spline. For example with the B-Splines above we can get following output:

bs = make_lsq_spline(x, y, augmented_t, k=degree)
bs.c

Now, we can loop through our data frame, training a spline on each row and returning instead of the 51 features, just the 9 features made up of the Betas from our spline above. This can help us combat high dimensionality and overfitting on smaller data sets while still maintaining high explained variance. Overall, this Spline Feature Selection approach can often beat out even FPCA/PCA for a better model performance score with even fewer required features.

Thanks for reading and please give me a follow if you liked the article!