Smart Chemical Systems

Black-Box Chemical Process Optimization

Relying on decision support for which experiment to perform next.

Georgi Tancev
Towards Data Science
12 min readOct 24, 2023

--

Photo by National Cancer Institute on Unsplash

Introduction

Designing and optimizing chemical processes is one of the main tasks in process engineering. When setting up a chemical system (e.g., a unit operation) with a large number of design parameters, the question of how to quickly arrive at the optimal design(s) arises frequently. If one had a model of the system, one could solve the problem numerically, i.e., by optimizing with respect to a specified metric (e.g., yield, material properties, cost, and so on). Often, however, this is not possible because the relationships (e.g., kinetics, physical phenomena) are not fully understood — or perhaps even not known at all. Therefore, formulating equations is not possible.

In such cases, the only option left is to find an optimal design by means of empirical models fed with data from experiments. Traditionally, for instance, one can refer to response surfaces and central composite designs to identify optimal operating conditions. These make use of local second-order approximations and gradient ascent/descent to locate the best configuration.

This article, however, is devoted to an alternative strategy, namely Bayesian optimization, which is related to reinforcement learning and has been successfully applied to the design of materials, chemical reactions, and drugs. It offers advantages like higher flexibility of models and processing of multi-fidelity information. The latter refers to the fact that mixed-quality data from different sources can be used for optimization, for example when physical models are at least rudimentarily available.

Problem Definition

Multi-Armed Bandits

You may ask, why do we need yet another optimization method? Let me answer this question by setting up the following scenario for you. You find yourself in front of a slot machine with k arms, that is, you have k arms to pull, and each arm i has a probability pᵢ to give you a reward rᵢ. Evidently, your goal will be to maximize your total reward R, but you only have a finite number of attempts (or budget) T. This is the so-called multi-armed bandit problem.

This problem is difficult because we do not initially know the probabilities or the rewards. With our T attempts, we have to simultaneously explore in order to “learn” the pᵢ’s and rᵢ’s but also to exploit rewarding arms (i.e., pull arms with a high expected reward pᵢrᵢ as frequently as possible) in order to accumulate them. This is the exploration-exploitation dilemma.

On the one side, we need to experiment and try out different arms, on the other side we have to stick to the same, promising arm(s) — and we have to balance both objectives carefully due to the limited budget. Similarly, we have to find optimal design settings. We need to learn as much as possible about our objective function through experiments, but also remain in the proximity of promising maximizers. We transition into the realm of Bayesian optimization when we move from the discrete setting with k arms to the continuous setting with infinite arms.

Bayesian Optimization

Roughly speaking, the idea is to sequentially learn the function f(x) that we want to optimize over a domain of interest 𝒳 and to move towards its maximum xᵒᵖᵗ,

xᵒᵖᵗ = arg max f(x), s.t. x ∈ 𝒳.

Typically, Bayesian optimization works well for optimization over continuous domains of less than 20 dimensions, and it tolerates stochastic noise in function evaluations. The method has two main “ingredients”: Gaussian processes and acquisition functions.

Gaussian Processes

The first ingredient in Bayesian optimization is the Gaussian process, which is best understood by deriving it from Bayesian linear regression. Bayesian statistics is an alternative theory in the field of statistics based on the Bayesian interpretation of probability in which probability expresses a degree of belief or information (i.e., knowledge) about an event. This differs from the frequentist interpretation that views probability as the limit of the relative frequency of an event after many trials.

Let us assume that the function f(x) that we want to maximize follows a model

y = f(x) + ϵ = βᵀx + ϵ, with x ∈ ℝ.

Although this model is linear in parameters, the basis vectors in X can also represent non-linearity through basis expansion. Furthermore, we stack the n data points/samples in a matrix X = [x₁ᵀ, …, xₙᵀ] and y = [y₁, …, yₙ].

Bayesian statistics revolves around the Bayes’ theorem, which states that

Although not explicitly stated, some of the distributions are also conditionally dependent on the data realization X. In particular, we are interested in the posterior distribution p(β|y), given our prior knowledge p(β) and the data distribution p(y|β). The prior expresses our knowledge before and the posterior after seeing the data. In the Bayesian interpretation of statistics, a parameter realization of the posterior distribution is essentially one possible model among many. If one interprets a model as a theory, then the prior is “all possible theories”, but data support some theories more than others, resulting in higher posterior probabilities for certain theories — or parameter combinations.

If we assume a normal (i.e., Gaussian) prior p(β) = 𝒩(0, I) for the parameters, and a normal likelihood p(y|β) = 𝒩(Xβ, σₙ²) of our data, the analytical solution will be normal as well and the multi-variate posterior distribution p(β|y) = 𝒩(μᵦ, Σ) becomes

μ = (XX + σₙ²I)⁻¹Xy,

Σ= (σₙ⁻²XX + I)⁻¹.

If we take the samples in X, we get a posterior predictive distribution for the output vector

p(y|X) = 𝒩(Xμ, XΣXᵀ + σₙ²I).

Fig. 1 illustrates one simple example (a polynomial function of degree four) in the one-dimensional case. Note how the ground truth is contained in the predicted interval.

Fig. 1: Predictive posterior distribution (μ ± 2σ). © Georgi Tancev.

The plot shows how the uncertainty increases with increasing distance from the data. The overall uncertainty consists of an epistemic part (XΣXᵀ) that decreases as the amount of data increases (the elements in the term XX increases with more data, its inverse decreases, and so does Σ), and an aleatoric part (σₙ²I) that remains constant no matter what. Moreover, this uncertainty is key because it suggests where knowledge is missing but also where the maximum might be, so we could make a new measurement at those locations in the domain 𝒳.

Observe that p(y|X) is a joint distribution over the output values, and it captures their covariance (or similarity) with each other. In general, points that are closer to each other will have more similar y-values than points that are further away. The key insight is that we can directly operate on the function values (instead of the parameter values) through a kernel (or covariance) function k(x₁, x₂) that measures the similarity between samples (i.e., data points), e.g., a linear kernel (with a scale parameter σₛ)

k(x₁, x₂) = σₛ ⋅ x₁ᵀx₂.

However, there are more expressive/flexible kernels such as the exponential kernel

k(x₁, x₂) = σₛ ⋅ exp(-γ ⋅ ‖x₁ - x₂‖²).

A kernel has to be symmetric and positive semidefinite — just like the covariance matrix. Furthermore, more sophisticated kernels are obtained through kernel engineering. For instance, for two kernels k₁(x₁, x₂) and k₂(x₁, x₂), their sums and products are also kernels. With such kernels, a kernel matrix K with pair-wise similarities can then be constructed. This is how the Gaussian process arises.

A Gaussian process y ∼ 𝒢𝒫(μ, k) is defined as a stochastic process in which every finite collection of random variables has a multi-variate normal distribution. Simply put, the kernel matrix “captures” how the individual points correlate with each other, and how this projects to (new) function values.

It is usually assumed that the (prior) mean function is zero. This can be easily ensured by standardizing the output values (i.e., subtracting the empirical average). To obtain the value for a new test point x*, we simply condition on the known data 𝒟 = {X, y}.

An important observation is that this method makes inferences based on memorized data. In other methods, loss functions must first be optimized. Nonetheless, kernel hyperparameters (e.g., σₛ, γ, σₙ) need to be fixed, which can be done through maximization of the marginal likelihood p(y), which itself is also an optimization method.

Acquisition Functions

With the same data as before, we can now fit a Gaussian process with an exponential kernel including diagonal noise term (Fig. 2).

Fig. 2: Predictive posterior distribution (μ ± 2σ) with a Gaussian process. © Georgi Tancev.

The uncertainty or confidence band suggests plausible function values at different locations in our domain. Observe that the uncertainty is larger than in Fig. 1, as the exponential kernel offers more flexibility by including more basis functions, making a larger set of function values plausible.

What we are asking is where in the domain do we think the maximum is. In particular, the maximum of the upper confidence bound provides information about where in the domain the highest function values can be expected. Using this upper confidence bound

x′ = arg max μ*(x) + θₜ √k*(x, x), s.t. x ∈ 𝒳,

as an acquisition function, we can acquire a plausible maximizer of f(x), i.e., a point x′ at which we believe (or expect) the optimum of f(x) to be — with our current knowledge about f(x). The factor θₜ balances exploration and exploitation. On the one hand, we get stuck in local optima, i.e., in the proximity of previously discovered maxima if we are too exploitative (low θₜ). On the other hand, we will use less information from previously identified optima if we are too explorative (high θₜ).

Theoretical results suggest schedulers such as θₜ ∝ √log t for optimal performance, although leading to excessive exploration. Alternatively, θₜ can be kept constant; to avoid falling into a local optimum, a random experiment can occasionally be run instead of the experiment suggested by the acquisition function.

Examples

An example is illustrated in the code block below; “GP” refers to the Gaussian process model instance, be it from scikit-learn, GPy, or GPflow.

# Import.
from scipy.optimize import minimize

# Define function.
def aquisition(x, model, theta=1.0):
x = np.asarray(x)
y_pred, y_std = model.predict(x.reshape(-1, 1), return_std=True)
return -(y_pred + theta * y_std)

x0 = 0.0 # initial value for optimization
domain = [[-10, 10]] # (safe) region of interest
res = minimize(acquisition, x0, args=GP, bounds=domain)
x_proposed = res.x # retrieve solution

Since this is not a convex problem, it is strongly recommended to run the optimization several times and to pick the best solution. In a second step, we then conduct an experiment under conditions x′, add the freshly collected data {x′, y′} to our dataset, 𝒟 ← 𝒟 ∪ {x′, y′}, and refit our model (Fig. 3). We can see that after one such iteration, we have not quite reached the optimum yet, but we have learned more about f(x). If we repeated the same process and asked for a new experimental condition, we may already be done.

Fig. 3: Predictive posterior distribution (μ ± 2σ) and updated predictive posterior distribution (μ ± 2σ) with a Gaussian process. © Georgi Tancev.

Let us look at another dataset, i.e., realization from the same distribution, before and after acquiring a new data point (Fig. 4).

Fig. 4: Predictive posterior distribution (μ ± 2σ) and updated predictive posterior distribution (μ ± 2σ) with a Gaussian process. © Georgi Tancev.

In this example, the acquisition function suggests that the maximum of f(x) is at the border of the domain. After performing this experiment, we realize that this seems not to be the case. However, we have learned with this experiment, and in the following experiment we will then identify the maximizer of f(x). We can show this with a final dataset (Fig. 5).

Fig. 5: Predictive posterior distribution (μ ± 2σ) and updated predictive posterior distribution (μ ± 2σ) with a Gaussian process. © Georgi Tancev.

The upper confidence bound is not the only acquisition function available. Other popular choices are Thompson sampling, probability of improvement, or expected improvement. They differ, for example, in how they balance exploration and exploitation. Whether and how fast we actually end up identifying (i.e., converging to) the maximizer also depends on the sequence of θₜ’s and on a correct model specification (i.e., kernel choice).

This is just a very short summary of the things we need to know about Gaussian processes, acquisition functions, and Bayesian optimization. Evidently, many details had to be left out. For other topics (kernel engineering, optimization of kernel parameters, and so on) it is worth taking a look at the relevant literature. Let us now move on to a short case study on multi-fidelity information.

Multi-Fidelity Information

In practice, we may have partial knowledge about our system, i.e., data of lower fidelity from a simple model. For instance, we may have a numerical simulations, or mathematical model with a subset of potential effects or within a subset of our domain. It would be sensible to include that somehow in our experimental designs.

Let us assume that a data point can come either from a model or from an experiment, resulting in multi-fidelity data. Furthermore, we have to correct our model for data points that come from experiments. Thus, we introduce a new variable w that tracks the origin of a data point; it will be the case that w = 1 if the sample comes from an experiment, and w = 0 otherwise. Then, we can decompose our function as follows.

f = fₘ + w fᵣ

fₘ refers to the contribution from our simple model and fᵣ is a correction term, which is only present for laboratory experiments (w = 1). By requesting independence (i.e., orthogonality) between fₘ and fᵣ, the resulting kernel will be

k(x₁, x₂) = kₘ(x₁, x₂) + ww₂ ⋅ kᵣ(x₁, x₂).

In GPy, we can define this as follows (assume w is in the last dimension of an array of length 2):

# Import GPy.
import GPy as gp

# Define kernel.
k_m = gp.kern.RBF(input_dim=1, active_dims=[0])
k_w = gp.kern.Linear(input_dim=1, active_dims=[1])
k_w.constrain_fixed(1.0) # fix the scale parameter of the linear kernel
k_f = gp.kern.RBF(input_dim=1, active_dims=[0])
k = k_m + k_w * k_f

# Define model.
model = gp.models.GPRegression(X_model,
Y_model,
kernel=k,
normalizer=True)

# Optimize hyperparameters.
model.optimize(ipython_notebook=True)

The noise is handled by the GPRegression class; we don’t need to specify it unless we expect different noise terms for experiments and models. Fig. 6 illustrates an example with the resulting Gaussian process, trained only on some model data. Evidently, our knowledge fails at the borders of our domain, as the ground truth deviates from our simple model and the resulting low-fidelity data.

Fig. 6: Multi-fidelity data. © Georgi Tancev.

We are now ready to perform our experiments. We query the acquisition function for proposals. This is shown in the simulations below (Fig. 7) with two different θₜ’s.

Fig. 7: Bayesian optimization with multi-fidelity data; left: θₜ=0.75, right: θₜ=0.95. © Georgi Tancev.

The acquisition starts directly at the borders of the domain, as this is where we expect the maximizer to be according to the behavior of our low-fidelity data (Fig. 6). In both cases, we find the maximum relatively quickly. If we notice that successive experiments are relatively close to each other, we can assume that we are done and can stop — even before we have used up our budget.

Conclusion

Having reached the end of the article, I am confident that I have been able to convince you about the merits of Bayesian optimization, which is currently a large research field. I personally believe that such tools will have a wide application in decision support in the future, especially in experimental science and engineering due to their ability to process knowledge from different sources. In this way, time and money can be saved, and new ideas can be generated, for example experiments that no one has ever thought of before.

--

--