The world’s leading publication for data science, AI, and ML professionals.

Gaussian Process Kernels

More than just the radial basis function

In a [previous article](https://towardsdatascience.com/gaussian-process-models-7ebce1feb83d) we explored the fundamentals of how Gaussian process models work – we explored the mathematical details, and derived the analytical solutions for the Gaussian process regression model. We also demonstrated how to use Gaussian process regression models to model some simple data. In this article, we will go deeper into the topic of Gaussian process models by exploring the various covariance kernels which can be used to model real life data. This article assumes that you already have a good understanding of Gaussian process models – if you are not quite yet familiar with them, please read the previous article first!

Desert in Abu Dhabi. Photo by Sajimon Sahadevan on Unsplash.
Desert in Abu Dhabi. Photo by Sajimon Sahadevan on Unsplash.

A Quick Recap on Gaussian Process Models

For φ observations of some target y = [_y_₁, _y_₂, … y]ᵀ corresponding to __ a set of φ input features x = [x₁, x₂, … x__ᵩ]ᵀ, we want to use Gaussian models to predict the value of _y_ᵩ₊₁ corresponding to some new input x__ᵩ₊₁.

The Gaussian process model’s prediction for the mean value of yᵩ₊₁ is:

μ = kC⁻¹y,

and the corresponding predicted variance is:

_s_² = c – kC⁻¹k,

where the covariance matrix C has the elements:

C[n, m] = k(x, x)+σδₙₘ,

the covariance vector k has the elements:

k[n] = k(x, x₊₁),

and the scalar c:

c = k(x₊₁, x₊₁)+σ,

where σ is the standard deviation of Gaussian noise, and k(x, x) is the kernel function which we will explore next.

#

The kernel function k(x, x) used in a Gaussian process model is its very heart – the kernel function essentially tells the model how similar two data points (x, x) are. Several kernel functions are available for use with different types of data, and we will take a look at a few of them in this section.

The Radial Basis Function Kernel

The one dimensional Gaussian function. Image created by the author.
The one dimensional Gaussian function. Image created by the author.

Perhaps the most widely used kernel is probably the radial basis function kernel (also called the quadratic exponential kernel, the squared exponential kernel or the Gaussian kernel):

k(x, x) = exp(-||xx||²/2_L_²),

where L the kernel length scale. This kernel is used by default in many Machine Learning libraries such as scikit-learn.

In general, a Gaussian process model will not be able to make extrapolations further than a distance of L away from the training data as shown in the figure below – the model extrapolations (blue curve) for f(x) = x fail very quickly as x increases away from the training data (black curve), even though the interpolations (red curve) within the training data are very accurate.

from sklearn.gaussian_process import GaussianProcessRegressor
import numpy as np
import matplotlib.pyplot as plt
# Create some toy data for the Gaussian process regressor model.
dx = 0.1
x = np.arange(-10, 10 + dx, dx)
X = x.reshape(-1, 1)
y = x * 1
# Fit the model to the toy data.
gpr = GaussianProcessRegressor()
gpr.fit(X, y)
# Perform interpolation prediction.
y_pred_in = gpr.predict(X)
# Create some data for extrapolation prediction.
x_out = np.arange(10, 20 + dx, dx)
X_out = x_out.reshape(-1, 1)
# Perform extrapolation prediction. The model should 
# not perform very well here.
y_pred_out = gpr.predict(X_out)
plt.plot(x, y, "k", linewidth = 6)
plt.plot(x, y_pred_in, "r")
plt.plot(x_out, y_pred_out, "b")
plt.legend(["Training data", "Interpolation", "Extrapolation"])
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()
Gaussian process model for the function (black curve): f(x) = x using the radial basis function kernel. The interpolations (red curve) are very good while the extrapolations (blue curve) fail very quickly. Image created by the author.
Gaussian process model for the function (black curve): f(x) = x using the radial basis function kernel. The interpolations (red curve) are very good while the extrapolations (blue curve) fail very quickly. Image created by the author.

The Constant Kernel

The constant kernel is the most basic kernel, defined as:

k(x, x) = κ,

where κ is some constant value.

The White Noise Kernel

The white noise kernel is unsurprisingly used to model white noise in the data:

k(x, x) = _ν_δₙₘ,

where ν is some noise level value.

The Exponential Sine Squared Kernel

The one dimensional exponential sine squared function. Image created by the author.
The one dimensional exponential sine squared function. Image created by the author.

The exponential sine squared kernel is sinusoidal in nature, and therefore able to model periodic data:

k(x, x) = exp(-2/_L_²·(sin(π·||xx||²/p))²),

where L is the kernel length scale and p is the kernel periodicity.

The Rational Quadratic Kernel

The one dimensional rational quadratic kernel. Image created by the author.
The one dimensional rational quadratic kernel. Image created by the author.

The rational quadratic kernel is equivalent to the summation of multiple radial basis function kernels with various length scales.

k(x, x) = (1 + ||xx||²/(2_aL_²))⁻,

where L is the kernel length scale and a determines the weighting of large and small scale variations.

Combinations of Kernels

Not only can we use the various kernel functions individually in a Gaussian process model, we can also combine them in order to make the model more powerful! As an example, we will use a Gaussian process model with a custom kernel to model the change in atmospheric carbon dioxide as measured at Mauna Loa. The data is provided by the NOAA and is freely available for public use. I extracted only the year, month and average columns from the downloaded .csv file using the pandas library.

As shown in the figure below, the measured atmospheric carbon dioxide concentration has been steadily increasing since 1958, with a distinctively periodic trend. We will try to extrapolate the trend 10 years (120 months) into the future from today.

Atmospheric carbon monoxide measurements (ppm) at Mauna Loa. Image created by the author.
Atmospheric carbon monoxide measurements (ppm) at Mauna Loa. Image created by the author.

If we naively use the default radial basis function in scikit-learn, we find that the model is able to interpolate the training data very well but completely fails to extrapolate into the future as shown in the figure below.

A Gaussian process model trained using the default settings (i.e. using a radial basis function kernel) completely fails to perform any sort of extrapolations into the future. Image created by the author.
A Gaussian process model trained using the default settings (i.e. using a radial basis function kernel) completely fails to perform any sort of extrapolations into the future. Image created by the author.

In order to use Gaussian process models to model the data effectively, we need to put more care and effort into crafting the model! We use the equations (5.15)–(5.19) outlined in Rasmussen and Williams (2006) to create a custom made kernel function in order to model the data as accurately as possible.

Long Term Rising Trend

In order to model the long term rising trend, we use a radial basis function kernel with a large time scale:

_k_₁(x, x) = A·exp(-||xx||²/2_L_₁²),

where A = 59.3² and _L_₁ = 390.

Using a large time scale ensures that the model is able to capture long term trends and changes in the time series.

Periodic Trend

We use an exponential sine squared kernel to model the yearly periodicity in the time series. Additionally, in order to account for drifts in the periodicity of the data, we add a radial basis function factor to the periodic kernel:

_k_₂(x, x) = B·exp(-||xx||²/2_L_₂₁² – 2/_L_₂₂²·(sin(π·||xx||²/p))²),

where B = 2.33², _L_₂₁ = 2330, _L_₂₂ = 1.26 and p = 1.09.

Medium and Short Term Irregularities

Medium and short term irregularities in the time series data can be modelled using the rational quadratic kernel, which is able to capture a wide range of length scales:

_k_₃(x, x) = C· (1+||xx||²/(2_aL_₃²))⁻,

where C = 0.596², a = 0.145 and _L_₃ = 4.74.

Noise

Noise in the data is modelled using a combination of a radial basis function kernel and a white noise kernel:

_k_₄(x, x) = D·exp(-||xx||²/2_L_₄²) + _ν_δₙₘ,

where D = 0.183², _L_₄ = 0.133 and ν = 0.0111.

Combining Kernels in a Gaussian Process Model

The custom kernel used to model the carbon dioxide time series is:

_CO2kernel = _k_₁(x, x)+_k_₂(x, x)+_k_₃(x, x)+_k_₄(x, x).

# Kernel of the trained sklearn Gaussian process regressor: 
print(gpr.kernel_)
59.3**2 * RBF(length_scale=390) + 2.33**2 * RBF(length_scale=2.33e+03) * ExpSineSquared(length_scale=1.26, periodicity=1.09) + 0.596**2 * RationalQuadratic(alpha=0.145, length_scale=4.74) + 0.183**2 * RBF(length_scale=0.133) + WhiteKernel(noise_level=0.0111)

The resulting trained Gaussian process model is able to make extrapolations on the atmospheric carbon dioxide concentrations about 10 years into the future as shown in the figure below. This is a huge improvement from the original model trained using the default radial basis function kernel.

Note that as the distance between the prediction date and the training data increases, the variance of the extrapolation becomes larger. Despite this, the model has been able to capture both the long term rising trend as well as the yearly periodicity in the training data!

A Gaussian process regression model trained using a custom kernel containing long term, medium term, short term length scales as well as periodicity is able to make extrapolations about the changes in atmospheric carbon dioxide into the near future. Image created by the author.
A Gaussian process regression model trained using a custom kernel containing long term, medium term, short term length scales as well as periodicity is able to make extrapolations about the changes in atmospheric carbon dioxide into the near future. Image created by the author.

Closing Remarks

In today’s article we explored in greater detail various kernels which can be used with Gaussian process models. Although the radial basis function kernel is very widely used and should work well in most situations, we showed that using a carefully crafted kernel will allow us to effectively make predictions in the near future for certain datasets.

A good understanding of how models work, and how to tune a model is indeed an essential skill for machine learning practitioners! If you would like to know more about Gaussian process models in greater detail, I strongly recommend reading both David K. Duvenaud (2014) and Rasmussen and Williams (2006). Thank you for reading!

References

[1] C. M. Bishop (2006), Pattern Recognition and Machine Learning, ‎ Springer. [2] David K. Duvenaud (2014). Automatic Model Construction with Gaussian Processes, PhD thesis, the University of Cambridge. [3] https://scikit-learn.org/stable/modules/gaussian_process.html [4] Carl E. Rasmussen and Christopher K. I. Williams (2006). Gaussian Processes for Machine Learning, MIT Press. [5] Carbon dioxide data obtained from: Dr. Pieter Tans, NOAA/GML _ (gml.noaa.gov/ccgg/trends/)_ and Dr. Ralph Keeling, Scripps Institution of Oceanography.


Related Articles