The world’s leading publication for data science, AI, and ML professionals.

An Introduction To Surrogate Optimization: intuition, illustration, case study, and the code

Want to deliver faster optimization for time-consuming objective functions? Consider surrogate optimization.

Want to deliver faster optimization for time-consuming objective functions? Consider surrogate optimization.

Fig. 1 Optimizing a complex function. (Image by Author)
Fig. 1 Optimizing a complex function. (Image by Author)

In engineering, a product optimization analysis involves finding the design parameter combination such that the objective function, e.g., product performance, manufacturing costs, etc., is maximized/minimized globally across the design parameter space.

The optimization routine requires many iterations to locate the global optimum. Within each iteration, high-fidelity, time-consuming computer simulations are usually conducted to evaluate the current design’s objective function. Consequently, the whole optimization process could easily take up to days or even weeks if the objective function is complex.

In our engineering team, we have implemented an optimization strategy called surrogate optimization to battle this issue. The results were quite impressive: we could easily achieve around 50-fold increases in optimization speed!

In this post, I want to share with you the technical details behind this method. In particular, I will discuss

  • how to leverage ** the surrogate mode**l to accelerate objective evaluations, and
  • how to employ active learning to refine the surrogate model on-the-fly to ensure optimization accuracy.

I’ve also created a companion Jupyter Notebook, which demonstrated how to do surrogate optimization in Python. All the figures illustrated in this article are reproduced in the notebook.

Ok, let’s get it started!

Table of Content · 1. Surrogate Optimization1.1 The Idea1.2 Surrogate Model Training1.3 The Workflow · 2. Active Learning2.1 The Idea2.2 Estimation of Prediction Uncertainty2.3 Learning Function · 3. Expected Improvement Function3.1 Case Study Settings3.2 Intuition3.3 Formula · 4. Surrogate Optimization in Action4.1 The Workflow4.2 First Iteration4.3 Further Iterations · 5. Takeaways · Reference · About the Author


1. Surrogate Optimization

1.1 The Idea

Since calling computer simulations really slows down the optimization process, why not we replace expensive simulations with a pre-trained statistical model to accurately approximate the objective function, given the same design parameters? Would that help gain optimization efficiency?

That’s exactly the basic idea of the surrogate optimization strategy. In the engineering domain, that statistical model is formally known as the surrogate model.

Fig. 2 Surrogate models can replace computer simulations to improve optimization efficiency. (Image by Author)
Fig. 2 Surrogate models can replace computer simulations to improve optimization efficiency. (Image by Author)

Essentially, a surrogate model is a statistical model that can accurately approximate a function’s output given the inputs. In general, a single evaluation of the surrogate model is much faster than a single evaluation of the original expensive computer simulation. As a result, performing hundreds and thousands of output evaluations given various combinations of design parameters would no longer be a problem. This feature enables us to freely explore the "landscape" of the objective function, therefore significantly accelerating the optimization speed.

1.2 Surrogate Model Training

Surrogate optimization cannot be achieved without a surrogate model. Training a surrogate model usually follows these steps:

First of all, we need to collect labeled training data. Toward that end, we probe the objective function at several intelligently selected locations in the design parameter space. At each of these locations, a full simulation is conducted to calculate the corresponding objective values. Later, we assemble the pairs of inputs (design parameters) and their outputs (objective values) into a training dataset.

Second, we need to select the model type. Popular choices include polynomial regressions, support vector machines, Gaussian Processes, neural networks, etc. Here, we could follow the established model selection practices [1] in supervised machine learning.

The final step is to train the model to approximate the objective function based on the collected training dataset.

To read more on surrogate modeling, check out this post:

An introduction to surrogate modeling, Part I: fundamentals

1.3 The Workflow

Once the surrogate model is trained, it is ready to be integrated into the optimization procedure. Overall, the surrogate optimization workflow looks like this:

Fig. 3 The workflow of surrogate optimization. (Image by Author)
Fig. 3 The workflow of surrogate optimization. (Image by Author)

Obviously, the reliability of the optimization now depends on the accuracy of the underlying surrogate model. What if the objective function is complex and we may need a large number of training samples to build a sufficiently accurate surrogate model?

Since generating training samples is quite expensive, what’s the good of surrogate optimization when it merely shifts the computational cost upfront?

The trick is that you don’t actually need to build a surrogate model that is accurate everywhere. The surrogate model only has to be reliable in the region where the global optimum might exist.

How exactly can we achieve that? Read on.


2. Active Learning

2.1 The Idea

Active learning refers to building a statistical model with a gradually enriched training dataset as the training progresses.

The motivation behind active learning is simple: as the statistical model learns more about the "landscape" of the underlying input-output function, the model would know in which regions it has already learned well and in which regions it still feels uncertain. Therefore, the model could actively ask for more training samples in those uncertain regions to gain further knowledge.

The steps illustrated below encoded the fundamental idea of active learning.

Fig. 4 The fundamental idea of active learning. (Image by Author)
Fig. 4 The fundamental idea of active learning. (Image by Author)

To successfully pull off active learning, we need two things:

  • A surrogate model that can estimate its prediction uncertainty;
  • A learning function.

2.2 Estimation of Prediction Uncertainty

First of all, the surrogate model should estimate its own prediction uncertainty, which usually takes the form of the variance or the standard deviation. This is crucial for achieving active learning, as prediction variance represents the state of the surrogate model’s knowledge regarding the "landscape" around this specific prediction site.

For example, a low variance value means that the surrogate model is rather confident in its prediction, thus indicating the model has already gained enough knowledge in the region surrounding this prediction site. A high variance value, on the other hand, means that the prediction is very uncertain, indicating the region around this prediction site remains underexplored for the model.

Various model types in supervised learning offer this kind of prediction uncertainty estimates. Basically, if your model is trained using Bayesian statistics, you could use the posterior predictive distribution as the prediction uncertainty. Our favorite model type is Gaussian Process: it performs quite well in the small data region, plus it assigns a normal distribution to each prediction, meaning that we automatically get the prediction variance to indicate the uncertainty.

2.3 Learning Function

Besides having a surrogate model that can quantify its own prediction uncertainty, the other key enabler is a well-crafted learning function.

A learning function guides sample enrichment by taking into account the prediction uncertainty estimated by the surrogate model. Generally, a new sample in the parameter space is selected when the learning function is maximized or minimized (according to the learning objective) at this sample location. The true sample label is further computed by running the simulation. Later on, we can add this newly labeled sample to the current training dataset, thus completing one iteration of active learning.

For example, suppose that we want to build a surrogate model that is accurate over the entire parameter space. What learning function should we use? Well, we could directly set the learning function as the prediction variance. Each time when we enrich the training dataset, we look for the location in the parameter space such that the surrogate model yields the largest prediction variance. In other words, we look for the location that the learning function is maximized.

This makes intuitive sense since the largest variance value indicates that the surrogate model feels least confident about its accuracy for this specific sample. Therefore, by providing the surrogate model with the true label of this particular sample, we would expect a larger amount of information gain than adding samples at other locations.

In practice, however, this naive learning function does not always perform well. Most of the time, it simply enriches the samples evenly across the whole parameter space. Consequently, if the investigated parameter space is large, we would need to enrich many samples to fill the space.

Apparently, this is not an efficient way to go. We need something better, something more tailored to our optimization goal.


3. Expected Improvement Function

Now is the time to introduce the learning function designed for global optimization purposes: expected improvement function. In Section 3 & 4, we will see how this learning function works in practice via optimizing a test function.

3.1 Case Study Settings

For this case study, our goal is to find the global minimum of an objective function f(x) using a surrogate optimization strategy. For the demonstration purpose, we assume the true underlying f(x) = (6x-2)²sin(12x-4), x∈[0, 1]. By plotting the function, we know that the function reaches the global minimum at x=0.76.

Fig. 5 The true underlying function. (Image by Author)
Fig. 5 The true underlying function. (Image by Author)

In practical settings, however, f(x) has an unknown expression and can only be evaluated via a complicated, time-consuming computer simulation. Therefore, to reflect reality, in the following optimization process, we will

  • treat f(x) as a black-box function;
  • limit the number of f(x) evaluations.

Those constraints rule out the feasibility of using a brute force approach to search for the global minimum. Instead, we will use the surrogate optimization strategy to achieve the goal.

Here, a Gaussian Process (GP) model is used as the surrogate model. You don’t have to be familiar with the GP method to understand how surrogate optimization works. Like I mentioned before, surrogate optimization can be achieved with various models. The only requirement is that they estimate prediction uncertainty. For now, all that matters is that the prediction y of the GP model at a sample location x follows a normal distribution, i.e., y ~ N(μ(x)_, σ_²(x)), where μ(x) and _σ_²(x) denote the prediction mean and variance, respectively.

3.2 Intuition

We kick-off the optimization procedure by training an initial GP model, with 6 training samples shown in the figure below. If we decide to trust this GP model, we will capture the wrong global minimum (shown in green dot). The reason is obvious – the trained GP model predicts far off in the region of 0.6~1, where the actual global minimum locates. This simple test serves as a strong motivation to model refinement.

Fig. 6 The initial GP model failed to capture the true global minimum. (Image by Author)
Fig. 6 The initial GP model failed to capture the true global minimum. (Image by Author)

Ok, we need to add more training samples. But without knowing the underlying function, how can we know which location is the best to allocate the next sample? The solution lies in the prediction uncertainty estimated by the GP model.

Recall that at any location x, the GP prediction y(x) follows a normal distribution, i.e., y(x) ~ N(μ(x)_, σ_²(x)). Therefore, depending on the distribution’s extent, the true function value could be even lower than the current minimum _y_ₘᵢₙ, as highlighted by the shaded area below.

Fig. 7 Due to the GP prediction uncertainty, there is an improvement potential even when the nominal prediction is larger than the current minimum. (Image by Author)
Fig. 7 Due to the GP prediction uncertainty, there is an improvement potential even when the nominal prediction is larger than the current minimum. (Image by Author)

In that case, the GP model believes that, at location x, there is some potential for improvement, i.e., the potential of finding a y(x) value smaller than the current minimum _y_ₘᵢₙ.

Naturally, we would be interested in identifying the location x in the parameter space such that the associated improvement potential is the highest. It then makes intuitive sense to refine the GP model at location x, which would bring us one step closer to the true global minimum.

In the next section, I will quantify the concept of "improvement potential."

3.3 Formula

Quantitatively, we can define improvement I in the following way:

Since y(x) is a random variable, I(x) also becomes a random variable. Therefore, we could calculate the expected value of I(x) and use it to measure the improvement potential.

For the random variable I(x), its expected value E[I(x)] can be calculated as

where p(y) is the probability density of the normal distribution N(μ(x)_, σ_²(x)). Graphically, the expression for E[I(x)] can be understood as a weighted sum of the possible values that I(x) can take, where each value is weighted by the probability of observing that I(x) value falls within an associated interval dy. In the limit of infinitesimal interval, the sum becomes an integral.

Fig. 8 Illustration of how to calculate expected improvement. (Image by Author)
Fig. 8 Illustration of how to calculate expected improvement. (Image by Author)

Since we know that y(x) follows a normal distribution N(μ(x)_, σ_²(x)), it is possible to derive an analytical expression for E[I(x)]:

where Φ(.) and ϕ(.) are the cumulative distribution function and probability density function of the normal distribution. The above equation for E[I(x)] is the expected improvement function at location x.

4. Surrogate Optimization in Action

Now that we’ve found the key component of the surrogate optimization, let’s see how this method works in practice.

4.1 The Workflow

To recap, here is the general workflow of the surrogate optimization.

Fig. 9 The workflow of the surrogate optimization. (Image by Author)
Fig. 9 The workflow of the surrogate optimization. (Image by Author)

At each learning iteration, we look for a sample location x* such that it maximizes E[I(x)]. Subsequently, we calculate the true objective value f(x*) and enrich the current training dataset with [x*, f(x*)]. Finally, we update the GP model using the enriched dataset, thus completing one learning iteration.

4.2 First Iteration

Let’s walk through the first iteration. The figure on the left shows the initial GP model trained from 6 initial training samples. 95% confidence intervals (μ±1.96σ) of the GP predictions are also plotted.

Fig. 10 The first iteration. (Image by Author)
Fig. 10 The first iteration. (Image by Author)

On the right, we can see the distribution of E[I(x)] values at different x locations. A prominent spike is observed at x=0.52, indicating a high potential for improvement at that location. This makes sense as the GP prediction μ(x) at x=0.52 is fairly close to the current minimum, and the associated prediction standard deviation σ(x) is rather large. Although σ(x) is also large in the region of x=0.6~1.0, μ(x) is way much larger than the current minimum, therefore not much hope for finding improvements.

So, in this first iteration, we pick x=0.52 as the next sample and calculate f(x)=0.98.

4.3 Further Iterations

With the newly obtained data (x=0.52, f(x)=0.98), we can update the GP model and its predictions, as shown in the figure on the left below.

Fig. 11 The second iteration. (Image by Author)
Fig. 11 The second iteration. (Image by Author)

Since the GP model is now more certain about the behavior of f(x) in the region x=0~0.6, its attention is shifted to the underexplored area, x=0.6~1. There, the potential of finding a global minimum is very high, as shown in the E[I(x)] on the right.

A prominent spike appears at x=0.72. Therefore, in this second iteration, we pick x=0.72 as the next sample and calculate f(x)=-5.36. Wow, that’s a huge improvement. The GP model has spotted something!

We enrich the current training dataset with (x=0.72, f(x)=-5.36) and re-train the GP model. Here are what we’ve obtained.

Fig. 12 The third iteration. (Image by Author)
Fig. 12 The third iteration. (Image by Author)

Notice that the E[I(x)] has only one spike at x*=0.76. This means that the GP model has excluded the possibility for improvement in other regions, as it makes quite certain predictions there, and those predictions are also much larger than the current minimum. Also, notice that the magnitude of E[I(x)] drops a lot. That’s the sign when we are not far away from locating the global minimum.

Now we pick x=0.76 as the next sample and calculate f(x)=-6.01. We make another update on the GP model:

Fig. 13 The final iteration. (Image by Author)
Fig. 13 The final iteration. (Image by Author)

We can see that the current GP model makes high-confident predictions almost everywhere. Meanwhile, the maximum E[I(x)] value has dropped to the order of 1e-5, indicating that we’ve identified the global minimum.

We do know its true global minimum for this test function, which exactly locates at x*=0.76. Therefore, we can call this GP-based surrogate optimization algorithm a success.

5. Takeaways

In this article, we’ve introduced the fundamental ideas of surrogate optimization and walked through a case study to see how this method is employed in practice. The key takeaways of this article include:

  • Surrogate optimization utilizes a surrogate model to approximate the expensive-to-evaluate objective function, thus greatly improved optimization efficiency;
  • Surrogate optimization employed the active learning strategy to gradually refine the surrogate model to ensure optimization accuracy;
  • Maximizing the expected improvement proves to be an efficient active learning strategy, which allows surrogate model refinement in the vicinity of the global optimum.

Reference

[1] David Schönleber, A "short" introduction to model selection, 2018, towards Data Science. [2] Alexander I. J. Forrester, András Sóbester, Andy J. Keane, Engineering Design via Surrogate Modelling: A Practical Guide, 2008.

About the Author

I’m a Ph.D. researcher working on uncertainty quantification and reliability analysis for aerospace applications. Statistics and data science form the core of my daily work. I love sharing what I’ve learned in the fascinating world of statistics. Check my previous posts to find out more and connect with me on Medium and Linkedin.


Related Articles