In a [previous article](https://towardsdatascience.com/gaussian-process-models-7ebce1feb83d) we explored the fundamentals of how Gaussian process models work – we explored the mathematical details, and derived the analytical solutions for the Gaussian process regression model. We also demonstrated how to use Gaussian process regression models to model some simple data. In this article, we will go deeper into the topic of Gaussian process models by exploring the various covariance kernels which can be used to model real life data. This article assumes that you already have a good understanding of Gaussian process models – if you are not quite yet familiar with them, please read the previous article first!

A Quick Recap on Gaussian Process Models
For φ observations of some target y = [_y_₁, _y_₂, … yᵩ]ᵀ corresponding to __ a set of φ input features x = [x₁, x₂, … x__ᵩ]ᵀ, we want to use Gaussian models to predict the value of _y_ᵩ₊₁ corresponding to some new input x__ᵩ₊₁.
The Gaussian process model’s prediction for the mean value of yᵩ₊₁ is:
μ = kᵀC⁻¹y,
and the corresponding predicted variance is:
_s_² = c – kᵀC⁻¹k,
where the covariance matrix C has the elements:
C[n, m] = k(xₙ, xₘ)+σδₙₘ,
the covariance vector k has the elements:
k[n] = k(xₙ, xᵩ₊₁),
and the scalar c:
c = k(xᵩ₊₁, xᵩ₊₁)+σ,
where σ is the standard deviation of Gaussian noise, and k(xₙ, xₘ) is the kernel function which we will explore next.
#
The kernel function k(xₙ, xₘ) used in a Gaussian process model is its very heart – the kernel function essentially tells the model how similar two data points (xₙ, xₘ) are. Several kernel functions are available for use with different types of data, and we will take a look at a few of them in this section.
The Radial Basis Function Kernel

Perhaps the most widely used kernel is probably the radial basis function kernel (also called the quadratic exponential kernel, the squared exponential kernel or the Gaussian kernel):
k(xₙ, xₘ) = exp(-||xₙ – xₘ||²/2_L_²),
where L the kernel length scale. This kernel is used by default in many Machine Learning libraries such as scikit-learn.
In general, a Gaussian process model will not be able to make extrapolations further than a distance of L away from the training data as shown in the figure below – the model extrapolations (blue curve) for f(x) = x fail very quickly as x increases away from the training data (black curve), even though the interpolations (red curve) within the training data are very accurate.
from sklearn.gaussian_process import GaussianProcessRegressor
import numpy as np
import matplotlib.pyplot as plt
# Create some toy data for the Gaussian process regressor model.
dx = 0.1
x = np.arange(-10, 10 + dx, dx)
X = x.reshape(-1, 1)
y = x * 1
# Fit the model to the toy data.
gpr = GaussianProcessRegressor()
gpr.fit(X, y)
# Perform interpolation prediction.
y_pred_in = gpr.predict(X)
# Create some data for extrapolation prediction.
x_out = np.arange(10, 20 + dx, dx)
X_out = x_out.reshape(-1, 1)
# Perform extrapolation prediction. The model should
# not perform very well here.
y_pred_out = gpr.predict(X_out)
plt.plot(x, y, "k", linewidth = 6)
plt.plot(x, y_pred_in, "r")
plt.plot(x_out, y_pred_out, "b")
plt.legend(["Training data", "Interpolation", "Extrapolation"])
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

The Constant Kernel
The constant kernel is the most basic kernel, defined as:
k(xₙ, xₘ) = κ,
where κ is some constant value.
The White Noise Kernel
The white noise kernel is unsurprisingly used to model white noise in the data:
k(xₙ, xₘ) = _ν_δₙₘ,
where ν is some noise level value.
The Exponential Sine Squared Kernel

The exponential sine squared kernel is sinusoidal in nature, and therefore able to model periodic data:
k(xₙ, xₘ) = exp(-2/_L_²·(sin(π·||xₙ – xₘ||²/p))²),
where L is the kernel length scale and p is the kernel periodicity.
The Rational Quadratic Kernel

The rational quadratic kernel is equivalent to the summation of multiple radial basis function kernels with various length scales.
k(xₙ, xₘ) = (1 + ||xₙ – xₘ||²/(2_aL_²))⁻ᵃ,
where L is the kernel length scale and a determines the weighting of large and small scale variations.
Combinations of Kernels
Not only can we use the various kernel functions individually in a Gaussian process model, we can also combine them in order to make the model more powerful! As an example, we will use a Gaussian process model with a custom kernel to model the change in atmospheric carbon dioxide as measured at Mauna Loa. The data is provided by the NOAA and is freely available for public use. I extracted only the year
, month
and average
columns from the downloaded .csv file using the pandas
library.
As shown in the figure below, the measured atmospheric carbon dioxide concentration has been steadily increasing since 1958, with a distinctively periodic trend. We will try to extrapolate the trend 10 years (120 months) into the future from today.

If we naively use the default radial basis function in scikit-learn, we find that the model is able to interpolate the training data very well but completely fails to extrapolate into the future as shown in the figure below.

In order to use Gaussian process models to model the data effectively, we need to put more care and effort into crafting the model! We use the equations (5.15)–(5.19) outlined in Rasmussen and Williams (2006) to create a custom made kernel function in order to model the data as accurately as possible.
Long Term Rising Trend
In order to model the long term rising trend, we use a radial basis function kernel with a large time scale:
_k_₁(xₙ, xₘ) = A·exp(-||xₙ – xₘ||²/2_L_₁²),
where A = 59.3² and _L_₁ = 390.
Using a large time scale ensures that the model is able to capture long term trends and changes in the time series.
Periodic Trend
We use an exponential sine squared kernel to model the yearly periodicity in the time series. Additionally, in order to account for drifts in the periodicity of the data, we add a radial basis function factor to the periodic kernel:
_k_₂(xₙ, xₘ) = B·exp(-||xₙ – xₘ||²/2_L_₂₁² – 2/_L_₂₂²·(sin(π·||xₙ – xₘ||²/p))²),
where B = 2.33², _L_₂₁ = 2330, _L_₂₂ = 1.26 and p = 1.09.
Medium and Short Term Irregularities
Medium and short term irregularities in the time series data can be modelled using the rational quadratic kernel, which is able to capture a wide range of length scales:
_k_₃(xₙ, xₘ) = C· (1+||xₙ – xₘ||²/(2_aL_₃²))⁻ᵃ,
where C = 0.596², a = 0.145 and _L_₃ = 4.74.
Noise
Noise in the data is modelled using a combination of a radial basis function kernel and a white noise kernel:
_k_₄(xₙ, xₘ) = D·exp(-||xₙ – xₘ||²/2_L_₄²) + _ν_δₙₘ,
where D = 0.183², _L_₄ = 0.133 and ν = 0.0111.
Combining Kernels in a Gaussian Process Model
The custom kernel used to model the carbon dioxide time series is:
_CO2kernel = _k_₁(xₙ, xₘ)+_k_₂(xₙ, xₘ)+_k_₃(xₙ, xₘ)+_k_₄(xₙ, xₘ).
# Kernel of the trained sklearn Gaussian process regressor:
print(gpr.kernel_)
59.3**2 * RBF(length_scale=390) + 2.33**2 * RBF(length_scale=2.33e+03) * ExpSineSquared(length_scale=1.26, periodicity=1.09) + 0.596**2 * RationalQuadratic(alpha=0.145, length_scale=4.74) + 0.183**2 * RBF(length_scale=0.133) + WhiteKernel(noise_level=0.0111)
The resulting trained Gaussian process model is able to make extrapolations on the atmospheric carbon dioxide concentrations about 10 years into the future as shown in the figure below. This is a huge improvement from the original model trained using the default radial basis function kernel.
Note that as the distance between the prediction date and the training data increases, the variance of the extrapolation becomes larger. Despite this, the model has been able to capture both the long term rising trend as well as the yearly periodicity in the training data!

Closing Remarks
In today’s article we explored in greater detail various kernels which can be used with Gaussian process models. Although the radial basis function kernel is very widely used and should work well in most situations, we showed that using a carefully crafted kernel will allow us to effectively make predictions in the near future for certain datasets.
A good understanding of how models work, and how to tune a model is indeed an essential skill for machine learning practitioners! If you would like to know more about Gaussian process models in greater detail, I strongly recommend reading both David K. Duvenaud (2014) and Rasmussen and Williams (2006). Thank you for reading!
References
[1] C. M. Bishop (2006), Pattern Recognition and Machine Learning, Springer. [2] David K. Duvenaud (2014). Automatic Model Construction with Gaussian Processes, PhD thesis, the University of Cambridge. [3] https://scikit-learn.org/stable/modules/gaussian_process.html [4] Carl E. Rasmussen and Christopher K. I. Williams (2006). Gaussian Processes for Machine Learning, MIT Press. [5] Carbon dioxide data obtained from: Dr. Pieter Tans, NOAA/GML _ (gml.noaa.gov/ccgg/trends/)_ and Dr. Ralph Keeling, Scripps Institution of Oceanography.