The world’s leading publication for data science, AI, and ML professionals.

A Parallel Implementation of Bayesian Optimization

The concept of 'optimization' is central to data science. We minimize loss by optimizing weights in a neural network. We optimize…

An introduction to a parallelizable optimization method suitable for expensive, discontinuous, or nondeterministic functions.

Image Uploaded by Author
Image Uploaded by Author

The concept of ‘optimization’ is central to data science. We minimize loss by optimizing weights in a neural network. We optimize hyper-parameters in our gradient boosted trees to find the best bias-variance trade-off. We use A-B testing to optimize behavior on our websites. Whether our function is a neural network, consumer behavior, or something more sinister, we all have something we want to optimize.

Sometimes the functions we are trying to optimize are expensive, and we wish to get to our destination in as few steps as possible. Sometimes we want to be confident that we find the best possible solution, and sometimes our functions don’t have a tractable gradient, so there is no nice arrow to point us in the right direction. Often, our functions have random elements to them, so we are really trying to optimize f(x) = y + e, where e is some random error element. __ Bayesian optimization is a function optimizer (maximizer) which thrives in these conditions.

Table of Contents

  1. What Is Bayesian Optimization
  2. Implementing From Scratch
  3. Implementing In Parallel
  4. Final Words

What Is Bayesian Optimization

Let’s say we have a function f, and we want to find the x which maximizes (or minimizes) f(x). We have many, many options. However, if our function fits the description right above the table of contents, we will definitely want to consider Bayesian optimization.

There are several different methods for performing Bayesian optimization. All of them involve creating an assumption about how certain things are distributed, making a decision based on that assumption, and then updating the assumption.

The method in this article uses Gaussian processes to create an assumption about how f(x) is distributed. These processes can be thought of as a distribution of functions – where drawing a random sample from a Gaussian distribution results in a number, drawing a random sample from a Gaussian process results in a function. If you are not familiar with Gaussian processes, this is a little hard to picture. I recommend this video, which is what made the concept click for me.

The algorithm itself can be summarized as such:

Image Uploaded by Author
Image Uploaded by Author

Implementing From Scratch

Here we walk through a single iteration of Bayesian optimization without using a package. The process is pretty straightforward. First, we define a toy function func we want to maximize, and then we sample it 4 times:

# Function to optimize
func <- function(input) {
  dnorm(input,15,5) + dnorm(input,30,4) + dnorm(input,40,5)
}
# Sample the function 4 times
func_results <- data.frame(input = c(5,18,25,44))
func_results$output <- func(func_results$input)
# Plot
library(ggplot2)
p <- ggplot(data = data.frame(input=c(0,50)),aes(input)) +
  stat_function(fun=func,size=1,alpha=0.25) +
  geom_point(data=func_results,aes(x=input,y=output),size=2) +
  ylab("output") +
  ylim(c(-0.05,0.2))
p + ggtitle("Our Function and Attempts")
Image Uploaded by Author
Image Uploaded by Author

We are pretending we don’t know the true function, so all we see in practice are the 4 points we sampled. For the sake of keeping this walk-through interesting, we did a pretty miserable job of selection our initial points. Let’s fit a Gaussian process to the 4 points to define our assumption about how output is distributed for each input.

library(DiceKriging)
set.seed(1991)
gp <- km(
    design = data.frame(input=func_results$input)
  , response = func_results$output
  , scaling = TRUE
)

Let’s take a look at our Gaussian process next to the points we have sampled and the true function value:

predGP <- function(x,grab) {
  predict(gp,data.frame(input=x),type = "UK")[[grab]]
}
a=1
cl = "purple"
plotGP <- function(grab,cl,a) {
  stat_function(
    fun=predGP,args=list(grab=grab),color=cl,alpha=a,n=1000
  )
}
p + ggtitle("Gaussian Process Results") +
  plotGP("mean",cl,a) +
  plotGP("lower95",cl,a) +
  plotGP("upper95",cl,a)
Image Uploaded by Author
Image Uploaded by Author

The Gaussian process allows us to define a normal distribution of the output for each input. In the picture above, the purple lines show the Gaussian process. The middle line is the mean, and the upper/lower lines are the 95th percentiles of the normal distribution at that input. So, for example, if we wanted to know how we assume the output is distributed at input = 30, we could do:

predict(gp,data.frame(input=30),type="UK")[c("mean","sd")]
$mean
[1] 0.05580301

$sd
[1] 0.007755026

This tells us that we are assuming, at input = 30, our output follows a normal distribution with mean = 0.0558, and sd = 0.0078.

Now that we have defined our assumption about the distribution of the output, we need to determine where to sample the function next. To do this, we need to define how ‘promising’ an input is. We do this by defining an acquisition function. There are several to choose from:

Image Uploaded by Author
Image Uploaded by Author

Of these, the upper confidence bound is the easiest to implement, so let’s define the function and plot it on our chart:

ucb <- function(x,kappa=3) {
  gpMean <- predGP(x,grab="mean")
  gpSD <- predGP(x,grab="sd")
  return(gpMean + kappa * gpSD)
}
a=0.25
p + ggtitle("Upper Confidence Bound") +
  plotGP("mean",cl,a) +
  plotGP("lower95",cl,a) +
  plotGP("upper95",cl,a) +
  stat_function(fun=ucb,color="blue")
Image Uploaded By Author
Image Uploaded By Author

We can see that our upper confidence bound is maximized (green diamond) somewhere between 10 and 15, so let’s find the specific spot, sample it, and update our GP:

# Find exact input that maximizes ucb
acqMax <- optim(
    par = 12
  , fn = ucb
  , method = "L-BFGS-B"
  , control = list(fnscale = -1)
  , lower = 10
  , upper = 20
)$par
# Run our function as this spot
func_results <- rbind(
    func_results
  , data.frame(input = acqMax,output = func(acqMax))
)

We have just completed one iteration of Bayesian optimization! If we continued to run more, we would see our chart evolve:

Image Uploaded by Author
Image Uploaded by Author

Implementing in Parallel

We won’t implement this part from scratch. Instead, we will use the ParBayesianOptimization R package to do the heavy lifting. This package allows us to sample multiple promising points at once. If there is only 1 promising point, it samples the surrounding area multiple times. So, in our first example, we would sample all 5 of the local maximums of the acquisition function:

Image Uploaded by Author
Image Uploaded by Author

Let’s get it up and running and see what comes out. We initialize the process with the 4 same points as above, and then run 1 optimization step with 5 points:

library(ParBayesianOptimization)
library(doParallel)
# Setup parallel cluster
cl <- makeCluster(5)
registerDoParallel(cl)
clusterExport(cl,c('func'))
# bayesOpt requires the function to return a list with Score
# as the metric to maximize. You can return other fields, too.
scoringFunc <- function(input) return(list(Score = func(input)))
# Initialize and run 1 optimization step at 5 points
optObj <- bayesOpt(
  FUN = scoringFunc
  , bounds = list(input=c(0,50))
  , initGrid = list(input=c(5,18,25,44))
  , iters.n = 5
  , iters.k = 5
  , acqThresh = 0
  , parallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
# Print the input and score of the first Epoch
optObj$scoreSummary[,c("Epoch","input","Score","gpUtility")]
 Epoch    input        Score gpUtility
     0  5.00000 0.0107981936        NA
     0 18.00000 0.0677578712        NA
     0 25.00000 0.0573468343        NA
     0 44.00000 0.0581564852        NA
     1 35.59468 0.0916401614 0.6558418
     1 50.00000 0.0107985650 0.6326077
     1 13.74720 0.0773487879 0.5417429
     1 21.13259 0.0462167925 0.4734561
     1  0.00000 0.0008863697 0.1961284

Our score summary shows us that bayesOpt ran our 4 initial points (Epoch = 0) and then ran 1 optimization step (Epoch = 1) in which it sampled all 5 of the local optimums of the acquisition function. If we ran for more iterations, we would continue to sample 5 points at a time. If our function to maximize was actually expensive, this would allow us to find the global optimum much faster.

The acqThresh parameter in bayesOpt is crucial to the sampling process. This parameter represents the minimum percentage of the global optimum of the acquisition function that a local optimum must reach for it to be sampled. For example, if acqThresh=0.5, then each local optimum (upper confidence bound, in our case) must be at least 50% of the global optimum, or it will be ignored. We set acqThresh=0, so all local optimums would be sampled.

Draw your attention to the gpUtility field above. This is the scaled value of the acquisition function (upper confidence bound in our case) at each of the points sampled. If you notice this value converging to 0 over Epochs, then the Gaussian process’ opinion is that there is not many promising points left to explore. A more thorough, package specific explanation can be found here.

Final Words

Bayesian optimization is an amazing tool for niche scenarios. In modern data science, it is commonly used to optimize hyper-parameters for black box models. However, being a general function optimizer, it has found uses in many different places. I personally tend to use this method to tune my hyper-parameters in both R and Python.


Written By

Topics:

Related Articles