The world’s leading publication for data science, AI, and ML professionals.

Gaussian Process Models

Simple Machine Learning Models Capable of Modelling Complex Behaviours

Gaussian process models are perhaps one of the less well known machine learning algorithms as compared to more popular ones such as linear regression models, tree based models or perceptron based models. This is unfortunate as Gaussian process models are one of the few machine learning models that can be solved analytically while still being able to model relatively complex systems.

Despite the current relative lack of popularity in the wider machine learning and Data Science community, Gaussian process models have found various uses in areas such as geostatistics and the optimization of expensive-to-evaluate functions, for example the hyper parameter searches for deep learning neural networks, or the optimization of lasers. Today, I will show you how Gaussian process models work, and teach you how to create your own Gaussian process regression model using Python!

Arizona National Park. Photo by Andrew Coelho on Unsplash.
Arizona National Park. Photo by Andrew Coelho on Unsplash.

Gaussian Process models assume that the value of an observed target yₙ has the form:

yₙ = f(x) + eₙ,

where f(x) is some function giving rise to the observed targets, x is the _n_th row of a set of φ inputs x = [x₁, x₂, … x]ᵀ, and eₙ is independent Gaussian noise. The conditional probability of observing yₙ given f(x) is the normal distribution:

p(yₙ|f(x)) = N(yₙ|f(x), σ),

where σ is the standard deviation of eₙ. As the noise is assumed to be independent for each sample, the joint probability distribution of φ observed target values y = [_y_₁, _y_₂, … y]ᵀ conditioned on φ values of f(x)=[f(x₁), f(x₂), … f(x)]ᵀ is defined to be the normal distribution:

p(y|f(x)) = N(y|f(x), σ),

where σ = σI is a diagonal matrix of size φ×φ.

In order to make predictions on y, we need to determine the marginal probability distribution ** __ p(y). This probability distribution can be obtained by marginalizing the conditional distribution p(_y|_f(x)) over the distribution p*(*f(x)) using the integral:

p(y) = ∫ p(y|f(x))·p(f(x)) df(x).

The distribution p(f(x)) is defined to be a Gaussian distribution with a mean of 0 and covariance kernel matrix K of size φ×φ:

p(f(x)) = N(f(x)|0, K).

The covariance matrix K is composed of distances between two rows in x, and assumes that similar inputs should give rise to similar target values in y. Each element in the matrix K is computed as:

K[n, m] = k(x, x),

where k is some function to be defined later. Using the equation for p(f(x)) above, we can perform the integral involved in p(y) to obtain the solution:

p(y) = ∫ p(y|f(x))·p(f(x)) df(x)**= ∫ N_(_y|f_(_x), σ)·N_(f(_x)|0, K) df_(_x) = N_(_y|0, C)**.

where the resulting covariance matrix has the form: C = K+σ = K+σI. Therefore, each element in C can be written as: C[n, m] = k_(_xₙ, _ xₘ) + σ_δₙₘ.

The Quadratic Exponential Kernel

Various covariance kernel functions for k(x, x) can be used, such as the constant kernel which as the name implies is basically a constant, the widely used quadratic exponential kernel (also known as the radial basis function (RBF)), or the periodic kernel which is normally used to model periodic data.

We will use the quadratic exponential kernel for the rest of this article. This kernel is calculated from pairs of samples (x, x) in x:

k(x, x) = exp(-||xx||²/2_L_²),

where L is a kernel hyper parameter which we will set to 1 for convenience’s sake.

Probability Distributions for New Observations

For φ observations of the target y = [_y_₁, _y_₂, … y]ᵀ corresponding to a set of φ inputs x = [x₁, x₂, … x**ᵩ]ᵀ, we want to predict the value of yᵩ₊₁ corresponding to some new input x**ᵩ₊₁. This step requires us to determine the parameters (i.e. mean and covariance) of the probability distribution _p(_yᵩ₊₁|y).

In order to determine the parameters of p(yᵩ₊₁|y), we start from the distribution p(y‘), where y‘ = [_y_₁, _y_₂, … y, yᵩ₊₁]ᵀ is a vector of length φ+1. From the solution for p(y) above, we obtain for p(y‘) the corresponding solution:

p(y‘) = N(y’|0, C’),

where the new covariance matrix C’ of size φ+1×φ+1 has the structure:

C’ = [[C , k], ………[kᵀ, c]],

where C = K+σI is the original φ×φ covariance matrix from above, k is a vector of length φ with elements given by: k[n] = k(x, x₊₁), and c is a scalar containing the covariance of x₊₁ with itself: c = k(x₊₁, x₊₁)+σ.

Gaussian Process Predictions

As mentioned earlier, Gaussian processes are one of the few Machine Learning models that have an analytical solution obtained from conditional probability as follows.

If we have some vector r of normally distributed N(r|μ, Σ) random variables, partitioned into two sub-vectors of arbitrary lengths: r = [r, r]ᵀ, then for the conditional distribution **** p(_r_ᵤ|_rᵥ) the mean **_ μ_(_r_|_rᵥ) and covariance Σ_(_rᵤ**_|_rᵥ) are given by:

μ(r|r) = μ + ΣᵤᵥΣᵥᵥ⁻¹(rμ),

Σ(r|r) = ΣᵤᵤΣᵤᵥΣᵥᵥ⁻¹Σᵥᵤ,

where μ/μ is the vector containing the means of the elements of r/r and Σᵤᵥ is the covariance of the elements of r and r

For our case, r corresponds to the new observation while r corresponds to the old set of φ observations. Hence the covariance between the old and new observations Σᵥᵤ is the vector k with elements k[n] = k(x, x₊₁), the covariance of the old observations Σᵥᵥ⁻¹ is the matrix C with elements C[n, m] = k(x, x) + σδₙₘ, the covariance of the new observations Σᵤᵤ is the scalar ** c = k(_x_ᵩ₊₁, ** __ xᵩ₊_₁)+_σ, and rᵥ is the set of φ old observations y. According to our definitions above μ=*μᵥ*= 0.

Putting everything together, the conditional probability distribution p(yᵩ₊₁|y) is a Gaussian distribution with mean:

μ = kC⁻¹y,

and variance:

_s_² = ckC⁻¹k.

Implementing Gaussian Models in Python

The analytical solutions above can be easily implemented in Python! First, we create a function which calculates the quadratic exponential kernel matrix for φ input features x = [x₁, x₂, … x]ᵀ:

K[n, m] = k(x, x) = exp(-||xx||²/2_L_²).

Furthermore, to simplify things we assume that σ = 0 such that C = K.

import numpy as np
def RBF_kernel(xn, xm, l = 1):
    """
    Inputs:
        xn: row n of x
        xm: row m of x
        l:  kernel hyperparameter, set to 1 by default
    Outputs:
        K:  kernel matrix element: K[n, m] = k(xn, xm)
    """
    K = np.exp(-np.linalg.norm(xn - xm)**2 / (2 * l**2))
    return K
def make_RBF_kernel(X, l = 1, sigma = 0):
    """
    Inputs:
        X: set of φ rows of inputs
        l: kernel hyperparameter, set to 1 by default
        sigma: Gaussian noise std dev, set to 0 by default
    Outputs:
        K:  Covariance matrix 
    """
    K = np.zeros([len(X), len(X)])
    for i in range(len(X)):
        for j in range(len(X)):
            K[i, j] = RBF_kernel(X[i], X[j], l)
    return K + sigma * np.eye(len(K))

The new prediction mean μ = kC⁻¹y and covariance _s_² = ckC⁻¹k can be calculated using the following functions.

def gaussian_process_predict_mean(X, y, X_new):
    """
    Inputs:
        X: set of φ rows of inputs
        y: set of φ observations 
        X_new: new input 
    Outputs:
        y_new: predicted target corresponding to X_new
    """
    rbf_kernel = make_RBF_kernel(np.vstack([X, X_new]))
    K = rbf_kernel[:len(X), :len(X)]
    k = rbf_kernel[:len(X), -1]
    return  np.dot(np.dot(k, np.linalg.inv(K)), y)
def gaussian_process_predict_std(X, X_new):
    """
    Inputs:
        X: set of φ rows of inputs
        X_new: new input
    Outputs:
        y_std: std dev. corresponding to X_new
    """
    rbf_kernel = make_RBF_kernel(np.vstack([X, X_new]))
    K = rbf_kernel[:len(X), :len(X)]
    k = rbf_kernel[:len(X), -1]
    return rbf_kernel[-1,-1] - np.dot(np.dot(k,np.linalg.inv(K)),k)

Making Predictions with Gaussian Process Models

Now all we need to do is to test the algorithm out with some data! We use a simple equation y = (x – 5)² with 5 data points: [1, 3, 5, 7, 9]. We predict the value of y for the new input value of 5.5.

def f(x):
    return (x-5) ** 2
# Training data x and y:
X = np.array([1.0, 3.0, 5.0, 7.0, 9.0])
y = f(X)
X = X.reshape(-1, 1)
# New input to predict:
X_new = np.array([5.5])
# Calculate and print the new predicted value of y:
mean_pred = gaussian_process_predict_mean(X, y, X_new)
print("mean predict :{}".format(mean_pred))
# Calculate and print the corresponding standard deviation:
sigma_pred = np.sqrt(gaussian_process_predict_std(X, X_new))
print("std predict :{}".format(sigma_pred))
mean predict :0.277673949912025
std predict :0.4150417380004999

We can also check the values of K used in our Gaussian process model.

for i in make_RBF_kernel(X):
    for j in i:
        print("{:.3f}".format(j), end = ", ")
    print()
1.000, 0.135, 0.000, 0.000, 0.000, 
0.135, 1.000, 0.135, 0.000, 0.000, 
0.000, 0.135, 1.000, 0.135, 0.000, 
0.000, 0.000, 0.135, 1.000, 0.135, 
0.000, 0.000, 0.000, 0.135, 1.000,

As a sanity check, we can use GaussianProcessRegressor from sklearn to check that our algorithm works!

from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor()
gpr.fit(X, y)
gpr_mean, gpr_std = gpr.predict(X_new.reshape(-1, 1), 
                                return_std = True)
print("sklearn pred: {}".format(gpr_mean))
print("sklearn std: {}".format(gpr_std))
sklearn pred: [0.27767395]
sklearn std: [0.41504174]

Likewise, we can also check the values of the kernel matrix used by sklearn. Note that sklearn uses the Cholesky decomposition of the original kernel matrix in order to optimize the calculations using the equation: K = LLᵀ.

for i in np.dot(gpr.L_, gpr.L_.T):
    for j in i:
        print("{:.3f}".format(j), end = ", ")
    print()
1.000, 0.135, 0.000, 0.000, 0.000, 
0.135, 1.000, 0.135, 0.000, 0.000, 
0.000, 0.135, 1.000, 0.135, 0.000, 
0.000, 0.000, 0.135, 1.000, 0.135, 
0.000, 0.000, 0.000, 0.135, 1.000,

Both our model and sklearn‘s model share the same prediction of 0.278 and standard deviation of 0.415 for the new input x = 5.5! The kernel matrices are the same too!

Making Predictions Outside the Training Data

Now let us see what happens if we use a value of x outside the range of the training data. For example, what happens if x = 15?

X_new = np.array([15])
mean_pred = gaussian_process_predict_mean(X, y, X_new)
print("mean predict :{}".format(mean_pred))
sigma_pred = np.sqrt(gaussian_process_predict_std(X, X_new))
print("std predict :{}".format(sigma_pred))
mean predict :2.396794716305008e-07
std predict :0.9999999999999999

We find that the standard deviation of the prediction is now much larger, and the predicted value is almost 0. While Gaussian process models are very good at interpolating data within the range of the training set, they are not good at extrapolating outside that range.

Plotting the Model’s 95% Confidence Intervals

Using the predicted means and the corresponding standard deviations, we can create and plot the model’s 95% confidence intervals for an entire range of input values of x.

import matplotlib.pyplot as plt
# Range of x to obtain the confidence intervals.
x = np.linspace(0, 10, 1000)
# Obtain the corresponding mean and standard deviations.
y_pred = []
y_std = []
for i in range(len(x)):
    X_new = np.array([x[i]])
    y_pred.append(gaussian_process_predict_mean(X, y, X_new))
    y_std.append(np.sqrt(gaussian_process_predict_std(X, X_new)))

y_pred = np.array(y_pred)
y_std = np.array(y_std)
plt.figure(figsize = (15, 5))
plt.plot(x, f(x), "r")
plt.plot(X, y, "ro")
plt.plot(x, y_pred, "b-")
plt.fill(np.hstack([x, x[::-1]]),
         np.hstack([y_pred - 1.9600 * y_std, 
                   (y_pred + 1.9600 * y_std)[::-1]]),
         alpha = 0.5, fc = "b")
plt.xlabel("$x$", fontsize = 14)
plt.ylabel("$f(x)$", fontsize = 14)
plt.legend(["$y = x^2$", "Observations", "Predictions", "95% Confidence Interval"], fontsize = 14)
plt.grid(True)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()
95% confidence intervals for the Gaussian process model.
95% confidence intervals for the Gaussian process model.

Instead of checking the standard deviation of each new input individually, using a graphical plot like this allows us to easily view where the model is most and least confident about its predictions!

More Advanced Covariance Kernels

In this article we used the quadratic exponential kernel in order to calculate our covariance kernel matrix K. There exist many other kernel functions however, which might lead to better performance for certain types of data. For example, there exists a periodic kernel which performs very well for periodic data! If you would like to find out more about these more advanced kernels, please read my article on Gaussian process kernels!

Closing Remarks

Today in this post we explored how Gaussian processes work, and created our own Gaussian process regression model using Python! Gaussian process models are extremely powerful and are widely used in both academia and industry. As an example of an industrial application, in a future post, I will show you how to use a Gaussian process based optimizer to determine the best lasing parameters in order to obtain a specific power output of a laser!

References

[1] C. M. Bishop (2006), Pattern Recognition and Machine Learning, ‎ Springer. [2] https://scikit-learn.org/stable/modules/gaussian_process.html [3] https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html


Related Articles