Gaussian Process: First Step Towards Active Learning in Physics

Maxim Ziatdinov
Towards Data Science
9 min readNov 1, 2021

--

Maxim Ziatdinov¹ ² & Sergei V. Kalinin¹

¹ Center for Nanophase Materials Sciences and ² Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, United States

Despite the extreme disparity in terms of objects and study methods, some tasks are common across multiple scientific fields. One of such tasks is an interpolation. Imagine having the measurements of some property of interest, such as temperature (if you are a meteorologist) or soil composition, or presence of useful ores (if you are a geologist) over some area. It would be extremely useful to interpolate these numbers between the (usually small) number of measurement points to obtain 2D maps amenable to the human eye. This can be approached using multiple methods including splines, kernel density approximations, neural network fits, and many others. However, when doing so, the second natural question is the uncertainty of these interpolated values, or to which extent they are trustable. The uncertainties are clearly minimal at the locations where the measurements were taken but will grow when moving away from the measurement points. Finally, the third and perhaps most interesting question is whether we can use the knowledge of the interpolated function and its uncertainty to guide our search strategy. In other words, given the measurements of property of interest in several locations, can we choose the next point for measurements — based on anticipated reward or probability to get one. All these problems can be addressed in a principled manner using Gaussian Process (GP) and GP-based Bayesian Optimization.

Similar problems, albeit defined over different parameter spaces, emerge in multiple other scientific domains. Search for materials with interesting and useful properties often implies the exploration of binary-, ternary-, and quaternary phase diagrams, i.e., 1D, 2D, and 3D compositional spaces. Tuning the microscope for optimal resolution implies tuning of several control parameters such as tip bias and tunneling current (for Scanning Tunneling Microscopy) or driving bias and setpoint (for Piezoresponse Force Microscopy). Even in theory/simulations, we often want to explore the phase diagram of a particular Hamiltonian by tuning a small number of exchange parameters. Clearly, all these problems can be solved via a grid search, that is, evaluating our system or running a microscope, or synthesizing a material sample for each combination of parameters. Equally obvious, this is extremely impractical. For example, for materials synthesis, if we want to explore compositions every 1% of the concentration, we will need 100 samples for a binary system, 10⁴ samples for a ternary system, and 10⁶ samples for a quaternary one. That will take a while!

In physical sciences, these problems are typically addressed using the combination of intuition, approximate methods, and experiments/simulations. For example, to define the phase diagram of Ising Hamiltonian as a function of temperature, the approximate methods can be used to get the system behaviors for very large and very low temperatures. Similarly, the universality class of the system and its behavior in the vicinity of the phase transition can be analyzed. The remaining task is then finding relevant transition temperature and critical exponents describing the system’s behavior in its vicinity. In experimental sciences (e.g., tuning a microscope), there are usually well-understood expectations of how the system behaves when a parameter changes. Finally, materials scientists use the combination of thermodynamic theory and measurements to explore the phase diagram within a small number of iterations. However, these approaches are very heterogeneous and domain-specific. How would a data scientist do it?

An approach to address these problems is a Gaussian Process. The formal definition and discussion of this approach are available in a number of excellent open publications and textbooks, particularly by Rasmussen and Williams. Here, we would like to offer a slightly unusual approach to describe the GP, centered on how the experiments improve our understanding of the world and how this knowledge is encoded in formal mathematical structure.

Imagine having a scalar function (or property of interest) defined over low-dimensional parameter space. This can be temperature over globe, materials hardness or magnetism over ternary phase diagram, or contrast over microscope image. Prior to any measurement (or calculation), we have no information on the possible values of the function — i.e., everything is possible. Upon measuring the values of the function at several locations, we, strictly speaking, know only the values in these locations and nothing else. However, it is not unreasonable to postulate that the values of the function we are interested in are somehow correlated across the parameter space. In physics, we often look for some specific functional form, either phenomenological or stipulated by fundamental physical laws. However, from the data science perspective, we postulate that the relationship between function values at the adjacent location is defined by a kernel function such as exp(-(xᵢ-xⱼ)²/2l²). The latter is known as Radial Basis Function and is one of the most common kernel functions in GP. The parameter l defines the characteristic lengthscale of the process and is usually inferred from data.

More formally, given the dataset D={xᵢ, yᵢ}, i=1, …, N, where x and y are input features and output targets, the Gaussian Process can be defined by the following probabilistic model:

where MVN is a multivariate normal distribution, m is a prior mean function (usually chosen to be some constant), K is our kernel function with (hyper)parameters σ and l for which we chose weakly-informative log-normal priors. We also assume there is normally distributed observation noise, ϵ~Normal(0, s²I), such that yₙₒᵢₛᵧ=y + ϵ. Practically, we absorb it into the computation of kernel function.

Having set up the model and prior distributions, the next step is to obtain the predictive posterior for new/unseen data. Here, we get posterior samples for the GP model parameters using a Hamiltonian Monte Carlo (HMC) algorithm from NumPyro’s probabilistic programming library. We then sample from the multivariate normal posterior over the model outputs at the provided new input points Xₙₑᵥᵥ:

where θ is a single HMC posterior sample containing kernel hyperparameters and model noise.

Let’s see how it works for a simple 1D function for which we get some sparse noisy observations.

Figure 1. Noisy observations of periodic function sin(10x). Figure by the authors.

We run the HMC for the model defined in Eq (1) given data observed in Figure 1 to obtain posterior samples for the model parameters. Then we compute posterior predictive distribution according to Eq (2) over the dense uniform grid of points between -1 and 1. The results are shown below:

Figure 2. The GP predictions sampled from multivariate normal posterior in Eq 2a. The center of the mass of the sampled means (Eq 2b) is also shown. Figure by the authors.

Note that a unique aspect of the GP compared to many other interpolation methods is that it provides information both on the expected function values, and the uncertainty for each predicted point. The latter is simply a variance (or standard deviation) in the sampled predictions. For the data in Figure 2, we will get a large uncertainty between -0.75 and -0.25 since there are no observations in that area. Note that the GP model still performed a pretty good job of predicting the mean function value over that region.

Before proceeding to how the uncertainty can be used to navigate a parameter space, let us quickly see what will happen if we change a kernel function. Specifically, we will replace our RBF kernel with a periodic kernel of the form exp(-2sin²(π(xᵢ-xⱼ)/p)/l²) with a uniform prior over the period p ~ Uniform(0.1, 1.0). We then re-run the HMC and re-compute our posterior predictive distribution. The results for the periodic kernel are shown below.

Figure 3. The GP prediction with the periodic kernel. Figure by the authors.

Clearly, the predictions are much better now (and the model is much more certain about them!). This suggests that the selection of the appropriate GP kernel can be guided by the physics of the problem.

The GP-predicted mean function value and the associated uncertainty can be used to derive the so-called acquisition function for the selection of the next measurement point in the parameter space. One of the commonly used acquisition functions is called upper confidence bound (UCB) and is essentially a linear combination of the predicted mean and uncertainty,

where k < 0 for the minimization problems and k > 0 for the maximization problems.

Below we show a simple 1D example of how the acquisition function works. Suppose we want to minimize some black-box function for which we have about one and half dozen of noisy observations (Figure 4). Without taking the uncertainty into account, our next measurement point would be at ~0.15, which appears to be the function’s minimum based on the available observations and GP reconstruction. However, there is a relatively high uncertainty between 0.4 and 0.6, as well as above ~0.65, so maybe there’s another, deeper minimum hiding there. The UCB acquisition function takes this uncertainty into account and outputs the next point to measure at ~0.5. We then perform measurements at that point, update our observations, and (if necessary) “re-train” our GP model to derive another acquisition function for getting the next measurement point. This balance (or trade-off) between exploiting the regions close to the already measured points and exploring the “unknown” is at the heart of Bayesian optimization. There are, of course, many other, more sophisticated, acquisition functions and we refer the interested readers to the review on Bayesian optimization by Nando de Freitas.

Figure 4. The GP prediction of the unknown objective function (left) and the derived UCB acquisition function (see Eq 3) for k = -2 (right). Figure by the authors.

To illustrate the Bayesian optimization process for a real physical system, we explore the phase diagram for the Ising model, as described in detail in our recent paper. Here, we used GP-based Bayesian optimization with the UCB acquisition function for the effective exploration of the parameter space of the lattice Hamiltonian and mapping the regions where specific macroscopic functionalities are maximized. In Figure 5 below, we show, via individual snapshots, an example of discovering regions of the parameter space in the next-nearest-neighbor Ising model where the heat capacity is maximized. One can see that the algorithm quickly zeros-down on the regions of interest allowing discovery of areas with the highest heat capacity in an effective (compared to a regular grid search) manner.

Figure 5. GP-based Bayesian optimization of heat capacity in the Ising model. Shown are all the points discovered by GP at steps 100, 200, 300, 400, and 600, as well as a full grid simulation (‘ground truth’). Figure by the authors.

To summarize, here we introduce the Gaussian Process and Bayesian Optimization methods and illustrate their implementation and application for a real physics problem. In addition to the analysis of the Ising model, the GP has been used to study the hysteresis loops in ferroelectrics, explore the compositional spaces of hybrid perovskites in automated synthesis, achieve super-resolution in scanning probe and electron microscopy and spectroscopy, and even control operational microscope exploring polarization dynamics in ferroelectric thin films.

Yet despite the power of GP, a simple implementation discussed here has a number of significant limitations. The first and obvious one is that the parameter space in which GP operates tends to be fairly low-dimensional, with optimal performance usually limited to <6 D spaces. The second limitation is that the structure of correlations is defined by the kernel function and while a large number of kernel functions including anisotropic and periodic have been proposed, most of these are phenomenological in nature. Finally, it tends to be computationally intensive.

However, as in many cases, the basic GP approaches can be readily extended in other directions. For example, the combination of the deep neural networks and Gaussian process yields the family of deep kernel learning methods that combine the expressivity of the deep learning and flexibility of GP. It is also possible to directly inject prior physical knowledge by substituting GP’s constant mean function with a probabilistic model of the expected system’s behavior. Stand by for future blog posts about DKL and sGP application in physical sciences!

Please also check out our GPim and gpax software packages for applying GPs to scientific data, with a focus on imaging and spectroscopy. Finally, in the scientific world, we acknowledge the sponsor that funded this research. This effort was performed and supported at Oak Ridge National Laboratory’s Center for Nanophase Materials Sciences (CNMS), a U.S. Department of Energy, Office of Science User Facility. You can take a virtual walk through it using this link and tell us if you want to know more.

The executable Google Colab notebook with basic GP examples discussed in the text is available here.

--

--