Kernel Regression from Scratch in Python

Everyone knows Linear Regression, but do you know Kernel Regression?

Published in

Towards Data Science

6 min readAug 20, 2020

Every beginner in Machine Learning starts by studying what regression means and how the linear regression algorithm works. In fact, the ease of understanding, explainability and the vast effective real-world use cases of linear regression is what makes the algorithm so famous. However, there are some situations to which linear regression is not suited. In this article, we will see what these situations are, what the kernel regression algorithm is and how it fits into the scenario. Finally, we will code the kernel regression algorithm with a Gaussian kernel from scratch. Basic knowledge of Python and numpy is required to follow the article.

Brief Recap on Linear Regression

Given data in the form of N feature vectors x=[x₁, x₂, …, xₙ] consisting of n features and the corresponding label vector y, linear regression tries to fit a line that best describes the data. For this, it tries to find the optimal coefficients cᵢ, i∈{0, …, n} of the line equation y = c₀ + c₁x₁+c₂x₂+…+cₙxₙ usually by gradient descent with the model accuracy measured on the RMSE metric. The equation obtained is then used to predict the target yₜ for new unseen input vector xₜ.

Linear regression is a simple algorithm that cannot model very complex relationships between the features. Mathematically, this is because well, it is linear with the degree of the equation being 1, which means that linear regression will always model a straight line. Indeed, this linearity is the weakness of the linear regression algorithm. Why?

Well, let’s consider a situation where our data doesn’t have the form of a straight line: let’s take data generated using the function f(x) = x³. If we use linear regression to fit a model to this data, we will never get anywhere close to the true cubic function because the equation for which we are finding the coefficients does not have a cubic term! So, for any data not generated using a linear function, linear regression is very likely to underfit. So, what do we do?

We can use another type of regression called polynomial regression which tries to find optimal coefficients of a (as the name suggests) polynomial equation with the degree of the equation being n, n⪈1. However, with polynomial regression another problem arises: as a data analyst, you cannot know what the degree of the equation should be so that the resulting equation fits best to the data. This can only be determined by trial and error which is made more difficult by the fact that above degree 3, the model built using polynomial regression is difficult to visualize.

This is where kernel regression can come to the rescue!

What is Kernel Regression?

Seeing the name, you may ask that if ‘linear’ in linear regression meant a linear function and ‘polynomial’ in polynomial regression meant a polynomial function, what does ‘kernel’ mean? Turns out, it means a kernel function! So, what is a kernel function? Simply, it is a similarity function that takes two inputs and spits out how similar they are. We will see shortly how a kernel function is used in kernel regression.

Now about kernel regression. Unlike linear and polynomial regression in which the optimal parameter vector c=[c₁, c₂, …, cₙ] needs to be learnt, kernel regression is non-parametric, meaning that it calculates the target yₜ by performing computations directly on the input xₜ.

How?

Given data points (xᵢ, yᵢ) Kernel Regression goes about predicting by first constructing a kernel k for each data point xᵢ. Then for a given new input xₜ, it computes a similarity score with each xᵢ (given by xᵢ-xₜ) using the kernel ; the similarity score acts as a weight wᵢ that represents the importance of that kernel (and corresponding label yᵢ) in predicting the target yₜ. The prediction is then obtained by multiplying the weight vector w= [w₁, w₂, …, wₙ] with the label vector y= [y₁, y₂, …, yₙ].

Image by Author: Kernel Regression in Equations

Now, there can be different kernel functions which give rise to different types of kernel regressions. One such type is the Gaussian Kernel Regression in which the shape of the constructed kernel is the Gaussian curve also known as the bell-shaped curve. In the context of Gaussian Kernel Regression, each constructed kernel can also be viewed as a normal distribution with mean value xᵢ and standard deviation b. Here, b is a hyperparameter that controls the shape (in particular, the width of the Gaussian curve in Gaussian kernels) of the curve. The equation for the Gaussian kernel k is given below. Notice the similarity between this equation and that of the Gaussian (also called normal) distribution.

Image by Author: Gaussian Kernel Equation

We will code this type of kernel regression next.

Coding Gaussian Kernel Regression

We will first look at the case of a one-dimensional feature vector and then extend it to n dimensions.

from scipy.stats import norm
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import mathclass GKR:
    
    def __init__(self, x, y, b):
        self.x = x
        self.y = y
        self.b = b
    
    '''Implement the Gaussian Kernel'''
    def gaussian_kernel(self, z):
        return (1/math.sqrt(2*math.pi))*math.exp(-0.5*z**2)
    
    '''Calculate weights and return prediction'''
    def predict(self, X):
        kernels = [self.gaussian_kernel((xi-X)/self.b) for xi in self.x]
        weights = [len(self.x) * (kernel/np.sum(kernels)) for kernel in kernels]
        return np.dot(weights, self.y)/len(self.x)

We define a class for Gaussian Kernel Regression which takes in the feature vector x, the label vector y and the hyperparameter b during initialization. Inside the class, we define a function gaussian_kernel() that implements the Gaussian kernel. You can see that we just write out the mathematical equation as code. Next, we define the function predict() that takes in the feature vector xₜ (referred to in code as X) whose target value has to be predicted. Inside the function, we construct kernels for each xᵢ, calculate the weights and return the prediction, again by plugging in the mathematical equations into code as-is.

Image by Author: Visualizing the Different Constructed Kernels

Now, let’s pass in some dummy data and see the prediction that is output. We predict the value for xₜ = 50 (by ignoring for demonstration purposes that it is already present in training data)

gkr = GKR([10,20,30,40,50,60,70,80,90,100,110,120], [2337,2750,2301,2500,1700,2100,1100,1750,1000,1642, 2000,1932], 10)
gkr.predict(50)

This gives us an output of 1995.285

Image by Author: Graphically, we can observe that the weights w_i for x_t = 50 are the points where a perpendicular from the point of intersection between different kernels and the dotted line meet the y-axis

Now, let’s extend the code for the case of n dimensional feature vectors. The only modification we need to make is in the similarity score calculation. Instead of obtaining the difference between xᵢ and xₜ, we calculate the similarity score in the n dimensional case as the Euclidean distance ||xᵢ-xₜ|| between them. Note that for the purposes of handling n dimensional vectors, we use numpy wherever needed.

from scipy.stats import multivariate_normal

'''Class for Gaussian Kernel Regression'''
class GKR:
    
    def __init__(self, x, y, b):
        self.x = np.array(x)
        self.y = np.array(y)
        self.b = b
    
    '''Implement the Gaussian Kernel'''
    def gaussian_kernel(self, z):
        return (1/np.sqrt(2*np.pi))*np.exp(-0.5*z**2)
    
    '''Calculate weights and return prediction'''
    def predict(self, X):
        kernels = np.array([self.gaussian_kernel((np.linalg.norm(xi-X))/self.b) for xi in self.x])
        weights = np.array([len(self.x) * (kernel/np.sum(kernels)) for kernel in kernels])
        return np.dot(weights.T, self.y)/len(self.x)

Image by Author: Visualizing the Constructed 3D Gaussian Kernels

Again, let’s pass in some 2D dummy data and predict for xₜ = [20, 40].

gkr = GKR([[11,15],[22,30],[33,45],[44,60],[50,52],[67,92],[78,107],[89,123],[100,137]], [2337,2750,2301,2500,1700,1100,1000,1642, 1932], 10)
gkr.predict([20,40])

We get yₜ = 2563.086.

The extended code (including for visualizations) for this article can be found on GitHub and Kaggle.

Conclusion

We saw where and why linear regression and polynomial regression cannot be used and with that background understood the intuition behind and the working of kernel regression and how it can be used as an alternative. We went into the details of the Gaussian kernel regression and coded it from scratch in Python by simply plugging in the mathematical equations to code.

I would love to connect with you on Linkedin!

References

[1] A. Burkov, The Hundred-Page Machine Learning Book (2019), Published by Andriy Burkov.