
Gaussian Process (GP) is a powerful supervised machine learning method that is largely used in regression settings. This method is desirable in practice since:
- it performs quite well in small data regime;
- it is highly interpretable;
- it automatically estimates the prediction uncertainty.
This last point is what sets GP apart from many other Machine Learning techniques: for a GP model, its prediction f(x) at a location x is not a deterministic value, but rather a random variable following a normal distribution, i.e., f(x) ~ N(μ(x), _σ_²(x)). Here, μ(x) denotes the prediction mean and _σ_²(x) is the prediction variance, which serves as an indicator for the prediction uncertainty.
So why estimating prediction uncertainty matters? There are two reasons: first, it enables reliable decision-making by informing us how much we can trust a specific prediction; second, it facilitates active learning, meaning that we could intelligently allocate training data that contribute the most to the model performance.
Despite its importance in practical applications, Gaussian Process does not appear a lot in machine learning books and tutorials. Partly is because GP theories are filled with advanced Statistics and linear algebra and are not exactly friendly to newcomers.
In this blog, we will adopt a learning-by-doing approach and implement a GP model in Python from scratch. In the end, we will put our prototype into action and approximate two analytical functions:

I’ve created a Jupyter Notebook for this article. Also, I’ve compiled a cheat sheet that summarizes GP-related formulas. You could use it as a quick reference when coding your own GP model.
So, let’s get started!
Table of Content
· 1. Understanding Gaussian Process
· 2. Kernel Function
· 3. GaussianProcess
class
· 4. Initializing A GP Model
· 5. Constructing Correlation Matrices
· 6. Training A GP Model (Theory)
· 7. Training A GP Model (Code)
· 8. GP Model Prediction (Theory)
· 9. GP Model Prediction (Code)
· 10. Benchmark
· 11. Takeaway
· About the Author
1. Understanding Gaussian Process
A common situation to employ GP method is this: we have collected some training data D = {(_x_ᵢ, _y_ᵢ), i=1,…,n}, with _y_ᵢ being the real-valued label. We want to train a model to predict the function output y given the input x.
In a nutshell, GP works by modeling the underlying true function y(x) as a realization of a Gaussian random process. It is precisely this statistical view that enables GP to predict complex input-output patterns and estimate the prediction uncertainty along the way. We will elaborate more on this in the following sections.
1.1 The "Process" in Gaussian Process
The "Process" part of its name refers to the fact that GP is a random process. Simply put, a random process is a function f(.) with the following properties:
- At any location x, f(x) is a random variable;
- At different locations _x_ᵢ and _x_ⱼ, the random variables f(_x_ᵢ) and f(_x_ⱼ) are correlated;
- The correlation strength between f(_x_ᵢ) and f(_x_ⱼ) depends on the distance between _x_ᵢ and _x_ⱼ. In general, as _x_ⱼ moves away from _x_ᵢ, their correlation strength decays.
The key takeaway here is that we can interpret a random process as an infinite collection of correlated random variables.
So, what probability distribution should we use to describe those random variables?
1.2 The "Gaussian" in Gaussian Process
The "Gaussian" part of its name indicates that GP uses Gaussian distribution (or normal distribution) to characterize the random process.
To characterize a single random variable with a Gaussian distribution, a mean value and a variance value are all we need. However, to characterize a random process that contains an infinite number of random variables, we will need to upgrade a Gaussian distribution to a Gaussian random process.
Formally, a Gaussian random process f(.) is characterized by a mean function μ(x) and a covariance function _σ_²K(x, x). Here, _σ_² denotes the overall process variance, and K(x, x) is the correlation function, also known as the kernel function. When x = x, K(x, x)=1; when x ≠ x, K(x, x) represents the correlation between f(x) and f(x).
Equipped with the notation of μ(x) and _σ_²K(x, x*), we can introduce the properties of a Gaussian random process f(.):
- For a single location x, f(x) follows a Gaussian distribution, i.e., f(x) ~ N(μ(x), _σ_²);
- For an arbitrary set of locations [_x_₁, _x_₂,…, _xₙ], f_ = [f(_x_₁), f(_x_₂),…, f(_x_ₙ)] follow a multivariate Gaussian distribution, i.e.,

1.3 GP’s "worldview"
Recall that at the beginning of this section, we mentioned that GP works by modeling the underlying true function y(x) as a realization of a Gaussian random process. What that means is that the GP model sees the observed labels [_y_₁, _y_₂,…, _y_ₙ] of the training data as one random draw from the above multivariate Gaussian distribution.
As a result, we are able to train a GP model by inferring the most likely μ(x), _σ_², and K(x, x*) that generate the observed labels. This is usually done via maximum likelihood estimation. Subsequently, we can make predictions at unseen sites with the trained GP model and estimate the associated prediction uncertainty.
At first sight, it may seem counterintuitive to involve an extra layer of complexity by introducing a random process to model the deterministic function values (y₁, y₂,…, yₙ). In fact, this treatment has its roots in Bayesian statistics. There, the random process assumption encodes our prior belief of the function form y(x). We won’t go deeper in this direction. For a Bayesian interpretation of the GP method, take a look at Murphy’s book, chapter 15.
2. Kernel Function
Before we dig into the coding, let’s talk about a GP model’s key ingredient: the kernel function. Previously, we only state its general form K(_x_ᵢ, _x_ⱼ) without giving concrete examples. We will fill the gap in this section.
2.1 Kernal features
In GP modeling settings, a kernel function measures the "similarities" between two different predictions, i.e., a higher kernel function value implies that the two predictions are more similar, and vice versa.
A kernel function has to fulfill certain requirements. One of them is that the kernel function K(_x_ᵢ, _x_ⱼ) has to be able to construct a correlation matrix that is symmetrical and positive definite.
In practice, the kernel function is often assumed to be stationary. This means that the kernel function value solely depends on the distance between the inputs, i.e., K(_x_ᵢ, _x_ⱼ) = K(_x_ᵢ-_x_ⱼ).
In addition, when the input has two or more features, we often assume the multi-dimensional kernel function K(_xᵢ, x_ⱼ) is a series of multiplication of one-dimensional kernel functions for each of the input feature:

where m denotes the total number of features, i.e., the input dimension. This treatment is beneficial as it allows tailored correlation structures for individual input features.
2.2 Gaussian kernel
Common kernel functions include the cubic kernel, exponential kernel, Gaussian kernel, and Matérn kernel. This article will use the Gaussian kernel, which is one of the most popular choices.
A one-dimensional Gaussian kernel is expressed as:

where θ is a kernel parameter that controls the correlation strength. An m-dimensional version of the Gaussian kernel is expressed as:

which is simply a series of multiplication of the one-dimensional Gaussian kernel. Here, we have the kernel parameters θ=[_θ_₁, _θ_₂,…, _θ_ₘ].
2.3 Interpretability of GP
The kernel parameters θ are the keys to the GP model’s interpretability, as it indicates the feature importance in making predictions. Let’s elaborate more on that.
First, let’s focus on the one-dimensional case where we only have one feature and one associated θ. The figure below shows how the choice of θ affects the correlation.

When θ value is low, predictions at distant sites maintain a high level of correlation. This implies that the underlying function yields similar outputs regardless of the x value. In other words, the feature x is not so predictive.
The same reasoning also applies to multiple-feature settings: if the θ value of the _k_th feature is rather low, then the underlying function would yield similar output values when moving along the _k_th dimension. Since the output is not so sensitive to the _k_th feature, we may conclude that the _k_th feature is not that important when making the prediction.
Therefore, by sorting the values of (_θ_₁, _θ_₂,…, _θ_ₘ), we can rank the feature importance, which could help perform sensitivity analysis or dimensionality reduction.
So much for the basics of Gaussian Process. In the following, we will approach GP from an implementation point of view and introduce the missing theories along with the code.
3. GaussianProcess
class
Let’s start coding a GaussianProcess
class from scratch! To help you navigate the equations associated with GP modeling, I’ve prepared this cheat sheet for you.
3.1 Class overview
Here is a summary of the methods and attributes of the GaussianProcess
class. This class mimics thescikit-learn
style by using .fit()
method to train the model and using .predict()
method to make predictions. The consistency makes it possible to further integrate the developed GaussianProcess
estimator with other scikit-learn
functions, such as Pipeline
, GridSearchCV
, etc.

GaussianProcess
class overview. (Image by Author)We will discuss each of those methods in the following sections.
3.2 Packages
First, we load all the necessary packages.
We will mainly use numpy
for matrix manipulations and matplotlib
for data visualization. In addition, we need
[scipy.optimize.minimize](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html)
: to perform optimization;[scipy.optimize.Bounds](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.Bounds.html)
: to specify parameter bounds for optimization;[pyDOE.lhs](https://pythonhosted.org/pyDOE/randomized.html)
: to generate random starting points for the optimizer. Here, we adopt a Latin Hypercube sampling approach because it is good at generating evenly distributed random samples.
4. Initializing A GP Model
In this section, we develop GaussianProcess.__init__(self, n_restarts, optimizer, bounds)
.
To train a GP model, we use a multi-start optimization strategy to estimate the model parameters. As a result, we need to specify how many starting points we want the optimizer to try and which algorithm this optimizer should use.
5. Constructing Correlation Matrices
In this section, we develop GaussianProcess.Corr(self, X1, X2)
, which computes a correlation matrix between a pair of feature matrices X1 and X2. Since computing correlation matrics is involved in both training and predicting (which will be discussed shortly), it is beneficial to have a dedicated function to achieve this goal.
An illustration of how to construct a correlation matrix of X1 and X2 using a Gaussian kernel is given below:

The following code can achieve the desired functionality:
6. Training A GP Model (Theory)
6.1 GP model parameters
A working GP model requires a mean function μ(x), the process variance _σ², and kernel parameter vector θ_. In practice, μ(x) is simply assumed to be a constant, i.e., μ(x) = μ. This treatment simplifies the math greatly and won’t hurt the model accuracy much. We will implement a GP with a constant mean in the coding sections.
So, training a GP model means estimating μ, _σ², and θ_. But how exactly should we do that? Introducing maximum likelihood estimation.
6.2 Maximum likelihood estimation
Simply put, this estimation approach works by finding a specific combination of μ, _σ², and θ_, such that the likelihood of observing the labels (_y_₁, _y_₂,…, _y_ₙ) of the training instances (_x_₁, _x_₂,…, _x_ₙ) is maximized.
Since GP assumes that [_y_₁,…, _yₙ] is generated from a multivariate Gaussian distribution, we could easily express the likelihood L_ based on the probability density function of the multivariate Gaussian distribution:

where y=[_y_₁,…, _yₙ], 1 is a vector of ones, and K_ is the correlation matrix of the training instances.
In practice, we prefer to maximize the logarithm of the likelihood L to avoid round-off error:

Luckily, analytical expressions exist for the optimum values of μ and _σ_² if we set the derivatives of log-likelihood with respect to μ and _σ_² to zeros:

However, for θ, since it’s nested inside the correlation matrix, deriving an analytical formula for the optimum θ is not feasible. **** Therefore, we need to employ an optimization algorithm to find a θ that maximizes the likelihood **** L.
We could simplify the L equation by substituting the optimum _σ² back into the L_. After some algebraic manipulations, we arrive at the following optimization problem:

In this article, we adopt a multi-start strategy to estimate the optimum θ. Basically, this strategy runs an optimizer multiple times, with each time starting from a different initial θ value. Finally, the θ value that yields the optimum objective function value is selected as the final optimization results.
7. Training A GP Model (Code)
In this section, we develop GaussianProcess.likelihood(self, theta)
and GaussianProcess.fit(self, X, y)
methods.
One thing worth mentioning here: from an implementation point of view, it is beneficial to search for θ on a logarithmic scale to accelerate the convergence of the optimization algorithm. Therefore, in the following code, we will be optimizing (μ, _σ², and 10^θ). A suitable search range for θ_ is [-3, 2], which translates to a correlation length varying between 0.001 and 100.
7.1 The likelihood function
First, we develop GaussianProcess.Neglikelihood(self, theta)
to calculate the negative likelihood. In the code snippet below, self.X and self.y constitute the training dataset. They are arrays of shape (n_samples, n_features) and (n_samples, 1), respectively.
Several things worth mentioning here:
- We add a small term (1e-10) to the diagonal of K. This term is also known as a "nugget" term, and it helps alleviate numerical instability issues when inverting K;
- We return the negative of the computed likelihood value. This step is necessary since we will be using
scipy.optimize.minimize
later, and minimizing the negative likelihood value is equivalent to maximizing the original likelihood value; - To ease understanding, I’ve naively employed
np.linalg.inv()
andnp.linalg.det()
to calculate the inverse and the determinant of K. In practice, however, those treatments could be time-consuming and may result in numerical instability. A more reliable alternative is to compute the Cholesky factorization ** of _K**_ and use the obtained lower triangle matrix for further matrix multiplications and determinant calculation. See the companion Notebook for this reliable implementation.
7.2 The model fitting function
Now we develop GaussianProcess.fit(self, X, y)
to enable model fitting. In the code snippet below, we use a Latin Hypercube method to generate starting points of θ for the optimizer to run. Since pyDOE.lhs
only generate random samples within [0, 1], we need to scale the samples to our target range [-3, 2].
8. GP Model Prediction (Theory)
Now that we know how to train a GP model, let’s switch gears to see how to make predictions with the trained GP model. Our goal is to predict the underlying function value at a test site x. We denote the prediction as f.
Recall that we mentioned in the very beginning that the GP prediction f is not a deterministic value but rather a random variable following a Gaussian distribution. This result is actually obtained by computing the distribution of f conditioned on y, __ i.e., *** the training instances’ observed labels. In Bayesian language, we are computing the posterior distribution of f.
In the following, we discuss how to calculate this conditional distribution of f. We will start from the joint distribution of f and y, i.e., **** P( **** y, f). From there, we can derive the desired conditional distribution P( f| y ).
8.1 Joint distribution of testing/training data
First of all, notice that by the definition of the Gaussian random process, we could express the joint distribution P( y, f*) as a multivariate Gaussian distribution:

where K is the correlation matrix of the training instances, k* is a correlation vector between the testing and training instances:

3.2 Conditional distribution of the testing data
In a second step, we derive P( f| y ), which describes how f is distributed given the observed labels y.
A nice property of the Gaussian random process is that the desired conditional distribution of f is also a Gaussian, i.e., f| y~ N(μ, Σ), where

f| y~ N(μ, Σ) fully characterizes the GP prediction at x. In practice, we use the mean μ as the expected prediction value and use the variance Σ to indicate the prediction uncertainty.
Deriving P( f| y) from their joint distribution P( f,y) is analytically tractable. To keep our discussion focused, I skip the lengthy algebra and refer readers to this [StackExchange page](http://Stackexchange page) for more details.
9. GP Model Prediction (Code)
In this section, we develop GaussianProcess.predict(self, X_test)
method to enable GP model prediction.
In the code snippet below, we intend to predict multiple test instances at once. As a result, in line 17, we would have a correlation matrix k (instead of only a correlation vector as stated in the theory section) between all the test instances and training instances. Consequently, in line 20 and in line 23, the calculated f and SSqr are now arrays.
Finally, let’s develop the GaussianProcess.score(self, X_test, y_test)
method to evaluate the model accuracy in terms of root-men-squared-error.
10. Benchmark
This section puts our developed GP class in action and tests its capability in approximating functions. We will first test the GP class on a 1D function. Later, we extend the testing to a 2D case.
10.1 1D Test
For this case study, we aim to train a GP model that can accurately approximate the following function:

This function is plotted below. As we can see, this function is highly nonlinear and possesses different levels of "activity" before and after x=0.6. Those features present challenges in terms of building a globally accurate model.

To train a GP model, we select 8 training instances that distribute within [0, 1] range. In the code snippet below, we first create a training and a test dataset. Then, we initiate an GaussianProcess
instance and adopt an "L-BFGS-B" algorithm to find the optimum θ. Since we are using a multi-start optimization strategy, we want the optimizer to run 10 times by using a different initial point each time. After training the model on the training dataset, the model is used to predict the test data labels.
The prediction results are shown below. We can see that the GP predictions are almost identical to the true function. The 95% confidence bands of the GP predictions, which are calculated as μ(x) +/- 1.96Σ(x), are also shown in the figure. We can see that the true function lies entirely within the confidence bands.

Therefore, we can conclude that the developed GP class is working properly, and a GP model is accurate enough to approximate the current 1D test function.
10.2 2D Test
For this case study, we aim to train a GP model that can accurately approximate a 2D Rosenbrock function:

This function is plotted below.

To train this GP model, we select a total of 25 training instances, which are depicted below. Those training samples are generated by a Latin Hypercube sampler.

The following code trains a GP model and makes predictions on a grid of test data. It is common to scale the GP input features within [0, 1]. Therefore, we adopt a MinMaxScaler
to do the scaling and integrate it into a pipeline.
Finally, we test the model prediction accuracy by using the .score
method. The calculated RMSE error is 2.44, which is quite low considering that our target function varies from 0 to 2400. To visually assess the model accuracy, we can plot GP predictions in a contour plot:

Indeed, the GP approximations are almost identical to the underlying true function, indicating that the trained GP model is highly accurate and our implementation of the GaussianProcess
class is working properly.
11. Takeaway
In this article, we’ve talked about the fundamental theories of the Gaussian Process modeling approach. By building a prototype from scratch, we translated the theory into the code and got down to the nitty-gritty from a practical implementation point of view.
There are definitely more about how one can go from the fundamental GP to more advanced versions and achieve fascinating analysis. This post covers some of those advanced concepts:
An introduction to surrogate modeling, Part III: beyond basics
About the Author
I’m a Ph.D. researcher working on uncertainty analysis and reliability analysis for aerospace applications. Statistics and data science form the core of my daily work. I love sharing what I’ve learned in the fascinating world of statistics. Check my previous posts to find out more and connect with me on Medium and Linkedin.