Gaussian process models are perhaps one of the less well known machine learning algorithms as compared to more popular ones such as linear regression models, tree based models or perceptron based models. This is unfortunate as Gaussian process models are one of the few machine learning models that can be solved analytically while still being able to model relatively complex systems.
Despite the current relative lack of popularity in the wider machine learning and Data Science community, Gaussian process models have found various uses in areas such as geostatistics and the optimization of expensive-to-evaluate functions, for example the hyper parameter searches for deep learning neural networks, or the optimization of lasers. Today, I will show you how Gaussian process models work, and teach you how to create your own Gaussian process regression model using Python!

Gaussian Process models assume that the value of an observed target yₙ has the form:
yₙ = f(xₙ) + eₙ,
where f(xₙ) is some function giving rise to the observed targets, xₙ is the _n_th row of a set of φ inputs x = [x₁, x₂, … xᵩ]ᵀ, and eₙ is independent Gaussian noise. The conditional probability of observing yₙ given f(xₙ) is the normal distribution:
p(yₙ|f(xₙ)) = N(yₙ|f(xₙ), σ),
where σ is the standard deviation of eₙ. As the noise is assumed to be independent for each sample, the joint probability distribution of φ observed target values y = [_y_₁, _y_₂, … yᵩ]ᵀ conditioned on φ values of f(x)=[f(x₁), f(x₂), … f(xᵩ)]ᵀ is defined to be the normal distribution:
p(y|f(x)) = N(y|f(x), σ),
where σ = σI is a diagonal matrix of size φ×φ.
In order to make predictions on y, we need to determine the marginal probability distribution ** __ p(y). This probability distribution can be obtained by marginalizing the conditional distribution p(_y|_f(x)) over the distribution p*(*f(x)) using the integral:
p(y) = ∫ p(y|f(x))·p(f(x)) df(x).
The distribution p(f(x)) is defined to be a Gaussian distribution with a mean of 0 and covariance kernel matrix K of size φ×φ:
p(f(x)) = N(f(x)|0, K).
The covariance matrix K is composed of distances between two rows in x, and assumes that similar inputs should give rise to similar target values in y. Each element in the matrix K is computed as:
K[n, m] = k(xₙ, xₘ),
where k is some function to be defined later. Using the equation for p(f(x)) above, we can perform the integral involved in p(y) to obtain the solution:
p(y) = ∫ p(y|f(x))·p(f(x)) df(x)**= ∫ N_(_y|f_(_x), σ)·N_(f(_x)|0, K) df_(_x) = N_(_y|0, C)**.
where the resulting covariance matrix has the form: C = K+σ = K+σI. Therefore, each element in C can be written as: C[n, m] = k_(_xₙ, _ xₘ) + σ_δₙₘ.
The Quadratic Exponential Kernel
Various covariance kernel functions for k(xₙ, xₘ) can be used, such as the constant kernel which as the name implies is basically a constant, the widely used quadratic exponential kernel (also known as the radial basis function (RBF)), or the periodic kernel which is normally used to model periodic data.
We will use the quadratic exponential kernel for the rest of this article. This kernel is calculated from pairs of samples (xₙ, xₘ) in x:
k(xₙ, xₘ) = exp(-||xₙ – xₘ||²/2_L_²),
where L is a kernel hyper parameter which we will set to 1 for convenience’s sake.
Probability Distributions for New Observations
For φ observations of the target y = [_y_₁, _y_₂, … yᵩ]ᵀ corresponding to a set of φ inputs x = [x₁, x₂, … x**ᵩ]ᵀ, we want to predict the value of yᵩ₊₁ corresponding to some new input x**ᵩ₊₁. This step requires us to determine the parameters (i.e. mean and covariance) of the probability distribution _p(_yᵩ₊₁|y).
In order to determine the parameters of p(yᵩ₊₁|y), we start from the distribution p(y‘), where y‘ = [_y_₁, _y_₂, … yᵩ, yᵩ₊₁]ᵀ is a vector of length φ+1. From the solution for p(y) above, we obtain for p(y‘) the corresponding solution:
p(y‘) = N(y’|0, C’),
where the new covariance matrix C’ of size φ+1×φ+1 has the structure:
C’ = [[C , k], ………[kᵀ, c]],
where C = K+σI is the original φ×φ covariance matrix from above, k is a vector of length φ with elements given by: k[n] = k(xₙ, xᵩ₊₁), and c is a scalar containing the covariance of xᵩ₊₁ with itself: c = k(xᵩ₊₁, xᵩ₊₁)+σ.
Gaussian Process Predictions
As mentioned earlier, Gaussian processes are one of the few Machine Learning models that have an analytical solution obtained from conditional probability as follows.
If we have some vector r of normally distributed N(r|μ, Σ) random variables, partitioned into two sub-vectors of arbitrary lengths: r = [rᵤ, rᵥ]ᵀ, then for the conditional distribution **** p(_r_ᵤ|_rᵥ) the mean **_ μ_(_rᵤ_|_rᵥ) and covariance Σ_(_rᵤ**_|_rᵥ) are given by:
μ(rᵤ|rᵥ) = μᵤ + ΣᵤᵥΣᵥᵥ⁻¹(rᵥ – μᵥ),
Σ(rᵤ|rᵥ) = Σᵤᵤ – ΣᵤᵥΣᵥᵥ⁻¹Σᵥᵤ,
where μᵤ/μᵥ is the vector containing the means of the elements of rᵤ/rᵥ and Σᵤᵥ is the covariance of the elements of rᵤ and rᵥ
For our case, rᵤ corresponds to the new observation while rᵥ corresponds to the old set of φ observations. Hence the covariance between the old and new observations Σᵥᵤ is the vector k with elements k[n] = k(xₙ, xᵩ₊₁), the covariance of the old observations Σᵥᵥ⁻¹ is the matrix C with elements C[n, m] = k(xₙ, xₘ) + σδₙₘ, the covariance of the new observations Σᵤᵤ is the scalar ** c = k(_x_ᵩ₊₁, ** __ xᵩ₊_₁)+_σ, and rᵥ is the set of φ old observations y. According to our definitions above μᵤ=*μᵥ*= 0.
Putting everything together, the conditional probability distribution p(yᵩ₊₁|y) is a Gaussian distribution with mean:
μ = kᵀC⁻¹y,
and variance:
_s_² = c – kᵀC⁻¹k.
Implementing Gaussian Models in Python
The analytical solutions above can be easily implemented in Python! First, we create a function which calculates the quadratic exponential kernel matrix for φ input features x = [x₁, x₂, … xᵩ]ᵀ:
K[n, m] = k(xₙ, xₘ) = exp(-||xₙ – xₘ||²/2_L_²).
Furthermore, to simplify things we assume that σ = 0 such that C = K.
import numpy as np
def RBF_kernel(xn, xm, l = 1):
"""
Inputs:
xn: row n of x
xm: row m of x
l: kernel hyperparameter, set to 1 by default
Outputs:
K: kernel matrix element: K[n, m] = k(xn, xm)
"""
K = np.exp(-np.linalg.norm(xn - xm)**2 / (2 * l**2))
return K
def make_RBF_kernel(X, l = 1, sigma = 0):
"""
Inputs:
X: set of φ rows of inputs
l: kernel hyperparameter, set to 1 by default
sigma: Gaussian noise std dev, set to 0 by default
Outputs:
K: Covariance matrix
"""
K = np.zeros([len(X), len(X)])
for i in range(len(X)):
for j in range(len(X)):
K[i, j] = RBF_kernel(X[i], X[j], l)
return K + sigma * np.eye(len(K))
The new prediction mean μ = kᵀC⁻¹y and covariance _s_² = c – kᵀC⁻¹k can be calculated using the following functions.
def gaussian_process_predict_mean(X, y, X_new):
"""
Inputs:
X: set of φ rows of inputs
y: set of φ observations
X_new: new input
Outputs:
y_new: predicted target corresponding to X_new
"""
rbf_kernel = make_RBF_kernel(np.vstack([X, X_new]))
K = rbf_kernel[:len(X), :len(X)]
k = rbf_kernel[:len(X), -1]
return np.dot(np.dot(k, np.linalg.inv(K)), y)
def gaussian_process_predict_std(X, X_new):
"""
Inputs:
X: set of φ rows of inputs
X_new: new input
Outputs:
y_std: std dev. corresponding to X_new
"""
rbf_kernel = make_RBF_kernel(np.vstack([X, X_new]))
K = rbf_kernel[:len(X), :len(X)]
k = rbf_kernel[:len(X), -1]
return rbf_kernel[-1,-1] - np.dot(np.dot(k,np.linalg.inv(K)),k)
Making Predictions with Gaussian Process Models
Now all we need to do is to test the algorithm out with some data! We use a simple equation y = (x – 5)² with 5 data points: [1, 3, 5, 7, 9]
. We predict the value of y for the new input value of 5.5
.
def f(x):
return (x-5) ** 2
# Training data x and y:
X = np.array([1.0, 3.0, 5.0, 7.0, 9.0])
y = f(X)
X = X.reshape(-1, 1)
# New input to predict:
X_new = np.array([5.5])
# Calculate and print the new predicted value of y:
mean_pred = gaussian_process_predict_mean(X, y, X_new)
print("mean predict :{}".format(mean_pred))
# Calculate and print the corresponding standard deviation:
sigma_pred = np.sqrt(gaussian_process_predict_std(X, X_new))
print("std predict :{}".format(sigma_pred))
mean predict :0.277673949912025
std predict :0.4150417380004999
We can also check the values of K used in our Gaussian process model.
for i in make_RBF_kernel(X):
for j in i:
print("{:.3f}".format(j), end = ", ")
print()
1.000, 0.135, 0.000, 0.000, 0.000,
0.135, 1.000, 0.135, 0.000, 0.000,
0.000, 0.135, 1.000, 0.135, 0.000,
0.000, 0.000, 0.135, 1.000, 0.135,
0.000, 0.000, 0.000, 0.135, 1.000,
As a sanity check, we can use GaussianProcessRegressor
from sklearn
to check that our algorithm works!
from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor()
gpr.fit(X, y)
gpr_mean, gpr_std = gpr.predict(X_new.reshape(-1, 1),
return_std = True)
print("sklearn pred: {}".format(gpr_mean))
print("sklearn std: {}".format(gpr_std))
sklearn pred: [0.27767395]
sklearn std: [0.41504174]
Likewise, we can also check the values of the kernel matrix used by sklearn
. Note that sklearn
uses the Cholesky decomposition of the original kernel matrix in order to optimize the calculations using the equation: K = LLᵀ.
for i in np.dot(gpr.L_, gpr.L_.T):
for j in i:
print("{:.3f}".format(j), end = ", ")
print()
1.000, 0.135, 0.000, 0.000, 0.000,
0.135, 1.000, 0.135, 0.000, 0.000,
0.000, 0.135, 1.000, 0.135, 0.000,
0.000, 0.000, 0.135, 1.000, 0.135,
0.000, 0.000, 0.000, 0.135, 1.000,
Both our model and sklearn
‘s model share the same prediction of 0.278
and standard deviation of 0.415
for the new input x = 5.5
! The kernel matrices are the same too!
Making Predictions Outside the Training Data
Now let us see what happens if we use a value of x
outside the range of the training data. For example, what happens if x = 15
?
X_new = np.array([15])
mean_pred = gaussian_process_predict_mean(X, y, X_new)
print("mean predict :{}".format(mean_pred))
sigma_pred = np.sqrt(gaussian_process_predict_std(X, X_new))
print("std predict :{}".format(sigma_pred))
mean predict :2.396794716305008e-07
std predict :0.9999999999999999
We find that the standard deviation of the prediction is now much larger, and the predicted value is almost 0. While Gaussian process models are very good at interpolating data within the range of the training set, they are not good at extrapolating outside that range.
Plotting the Model’s 95% Confidence Intervals
Using the predicted means and the corresponding standard deviations, we can create and plot the model’s 95% confidence intervals for an entire range of input values of x.
import matplotlib.pyplot as plt
# Range of x to obtain the confidence intervals.
x = np.linspace(0, 10, 1000)
# Obtain the corresponding mean and standard deviations.
y_pred = []
y_std = []
for i in range(len(x)):
X_new = np.array([x[i]])
y_pred.append(gaussian_process_predict_mean(X, y, X_new))
y_std.append(np.sqrt(gaussian_process_predict_std(X, X_new)))
y_pred = np.array(y_pred)
y_std = np.array(y_std)
plt.figure(figsize = (15, 5))
plt.plot(x, f(x), "r")
plt.plot(X, y, "ro")
plt.plot(x, y_pred, "b-")
plt.fill(np.hstack([x, x[::-1]]),
np.hstack([y_pred - 1.9600 * y_std,
(y_pred + 1.9600 * y_std)[::-1]]),
alpha = 0.5, fc = "b")
plt.xlabel("$x$", fontsize = 14)
plt.ylabel("$f(x)$", fontsize = 14)
plt.legend(["$y = x^2$", "Observations", "Predictions", "95% Confidence Interval"], fontsize = 14)
plt.grid(True)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()

Instead of checking the standard deviation of each new input individually, using a graphical plot like this allows us to easily view where the model is most and least confident about its predictions!
More Advanced Covariance Kernels
In this article we used the quadratic exponential kernel in order to calculate our covariance kernel matrix K. There exist many other kernel functions however, which might lead to better performance for certain types of data. For example, there exists a periodic kernel which performs very well for periodic data! If you would like to find out more about these more advanced kernels, please read my article on Gaussian process kernels!
Closing Remarks
Today in this post we explored how Gaussian processes work, and created our own Gaussian process regression model using Python! Gaussian process models are extremely powerful and are widely used in both academia and industry. As an example of an industrial application, in a future post, I will show you how to use a Gaussian process based optimizer to determine the best lasing parameters in order to obtain a specific power output of a laser!
References
[1] C. M. Bishop (2006), Pattern Recognition and Machine Learning, Springer. [2] https://scikit-learn.org/stable/modules/gaussian_process.html [3] https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html