The world’s leading publication for data science, AI, and ML professionals.

A Zero Maths Understanding of Bayesian Optimization

Building a simple intuition for Bayesian optimization for Machine Learning

I gifted you a sophisticated and hypothetical coffee machine and asked you to brew the best coffee for yourself by modulating the thousands of dials that are there on the machine. You are an intelligent fellow and quickly realise that it is an optimization problem. You have two options:

  1. Change the settings of the dials umpteen times, brew different types of coffees, taste all of them, find your best brew, and then die from caffeine overdose.
  2. Try to find a function brew-quality = f(brew-styles) by quantifying various factors involved and find global maxima using the gradient descent method by taking advantage of calculating the derivatives of the function.
Source: Unsplash
Source: Unsplash

There are two things to note here:

  1. We don’t really know what the function is, it is a black box. We brew coffee by modulating dials and we get the coffee as the output, what happens inside the machine, we don’t really know. The only information we have is how does the coffee taste on a particular setting of the machine.
  2. Even if we knew what the brew-quality function was, it would be an expensive one to evaluate for we can’t have thousands of cups of coffee for obvious reasons given stomach capacity and health hazards.

So, what shall we do?

There is a framework that can solve this problem and that is Bayesian optimization.

It’s applicable when the function is a black box and thus gradient descent is not an option and/or the function is expensive to evaluate. Both the criteria are fulfilled here.

Bayesian optimization

Let’s assume the black-box function for the brew-quality is:

Original function, hidden from the user. It's a black-box function (image by author)
Original function, hidden from the user. It’s a black-box function (image by author)

The function is a black box and we can only evaluate it for different inputs(brew styles).

Let’s say we want to find the best brew after sampling only 15 cups of coffee, what we’ll do is brew a few cups of coffee < 15, in this case, we brewed 6) and have an estimated function as shown below in the red.

Estimated function in the first iteration(image by author)
Estimated function in the first iteration(image by author)

Now, we’ll use this estimated function to determine where to evaluate next.

This estimation of the original function from the estimated function is a Gaussian Process(you don’t need to know about it at the moment). The estimated function on the other hand is called the surrogate function(just nomenclature)

Estimated function in the second iteration(image by author)
Estimated function in the second iteration(image by author)

Create the new-estimated function with the help of newly minted data.

Estimate new points of evaluation(third iteration) from the newly estimated function. Keep on repeating this process till you exhaust the number of evaluations(in our case 15). And hope that at the end of 15 iterations, the estimated function is a good approximation of the original function.

If the estimated function is decent enough, then this methodology would work and save you a lot of time.

This method is much more intuitive and imitates real life than evaluating each and every cup of coffee for brew quality. It is Bayesian in nature(discussed below). Let’s say for a particular style of100 ml brew in a cup, you can modulate milk quantity from 15 ml to 50 ml in the increment of 1 ml. If you already know that post 20 ml of milk, the coffee is too weak for your liking then why would you evaluate the cups of coffee that have 20 to 50 ml of milk.

The random search on the other hand might evaluate many cups from 20 to 50 ml milk content and thus waste evaluations.

Please note that the goal is not to determine the orange function but the values for the acquisition function that will help determine the surrogate in the least number of computations.

This ends the zero-maths part of the write-up. I’ll continue with more details and try to stay away from complicated equations as much as possible.

A note about the Gaussian Process

As mentioned above, the estimation of the original function from the surrogate is a type of Gaussian process. Let’s try to understand what it means.

We use the estimated function (blue line) to determine the evaluation points(what types of coffee to sample next) and thus improve the estimated function(get as close to the orange dotted function).

Now, based on the blue line, how can we infer more about the orange line?

We want two things here:

  1. We want to evaluate at the points where the output is high(remember that we are maximizing brew-quality and thus high value is appreciated). This is exploitation
  2. We should explore the areas of the curve which we know less about e.g. we don’t know much about what’s going on between 2 and 3 or between 5 and 10, so the new points of evaluation should come from there. This is exploration.

This estimation of new points is done via a function called the acquisition function. This function is responsible to balance the trade-off between exploration and exploitation and will let us know how well we are accomplishing this balance.

(The following equations are discussed in a post in Quora)

Common acquisition functions include expected improvement and maximum probability of improvement.

This acquisition function isn’t large, it is much cheaper to evaluate and thus optimizing it is much easier than doing something else.

Let’s define two values,

μ(x) = The estimated value of the function for any input x(blue line)

σ(x) = Standard deviation of the values at various values of x(standard deviation w.r.t blue line)

Let’s define a function, ζ(x)

On basis of this ζ function, one can define a(x), the acquisition function.

What’s going inside a(x) is beyond the scope of this simple post, but what a(x) is essentially doing is trying to balance between low μ(x) and high σ(x). μ(x) accounts for exploitation and σ(x) for exploration.

This pyData talk is a useful one to understand more on Gaussian Processes.

This article on TDS discusses it as well.

What is Bayesian about Bayesian Optimization?

(The following part is taken from my old post on Bayesian Statistics)

Bayesian methods __ take the prior knowledge of the system into account. Bayesians think there is always something that we know about the system, so, why not use it to make better guesses about the system.

The data(function evaluation in this case) along with the prior knowledge are taken together to estimate what’s known as the posterior and it is not a single value but a distribution.

The common vernacular of Bayes’ theorem is:

P(β|y, X) is called the posterior distribution of the values that need to be evaluated, β, given the data X and y, where X is input and y is output.

P(y|β, X) is the likelihood of the data which is multiplied by the prior probability of the parameters, P(β|X), and divided by P(y|X) which is known as a normalization constant. This normalisation parameter is required to make the sum of values in P(β|y, X) equals 1.

In nutshell, we are using our prior information about the surrogate function to estimate posterior.

In the case of Bayesian optimization:

  1. We place a prior(defined by the Gaussian process that tries to capture our beliefs on the function behaviour) on the surrogate function.
  2. Function evaluations are treated as data and used to update the prior to get the posterior distribution over the objective function.

Conclusion

BO is intuitive as it imitates real life, we have our prior beliefs and we use them to estimate output, the estimate becomes new prior and we get updated posterior. It is much better to use BO than gradient descent and its flavours when we are dealing with expensive computations and the problem space is too large.

In Machine Learning, BO is used for hyperparameter optimization e.g. the number of trees in a random forest, their depth, leaves etc can be tuned with the help of BO. There are many libraries in many languages that can be used to explore the concepts further such as hyperopt, botorch, bayes-optim etc and can witness the power hands-on.

We tried to introduce a difficult and terse mathematical process with the help of a scenario that is more close to daily life.

Ok, not exactly! This hypothetical coffee machine doesn’t exist but nevertheless, you have the tool to find the best brew with the help of mathematical methods.

Sources:

https://static.sigopt.com/773979031a2d61595b9bda23bb81a192341f11a4/pdf/SigOpt_Bayesian_Optimization_Primer.pdf

Github for creating the charts: https://github.com/Prashantmdgl9/Bayesian_Optimization


Related Articles