Building a Tree-Structured Parzen Estimator from Scratch (Kind Of)

An alternative to traditional hyperparameter tuning methods

Colin Horgan
Towards Data Science

--

Visualization of TPE for 2-dimensional hyperparameter tuning. Image by Alexander Elvers via Wikipedia Commons.

The way a machine learning model fits itself to data is governed by a set of initial conditions called hyperparameters. Hyperparameters help to restrict the learning behavior of a model so that it will (hopefully) be able to fit the data well and within a reasonable amount of time. Finding the best set of hyperparameters (often called “tuning”) is one of the most important and time consuming parts of the modeling task. Historical approaches to hyperparameter tuning involve either a brute force or random search over a grid of hyperparameter combinations called Grid Search and Random Search, respectively. Although popular, Grid and Random Search methods lack any way of converging to a decent set of hyperparameters — that is, they are purely trial and error. In this article we will tackle a relatively new approach for hyperparameter tuning — the Tree-Structured Parzen Estimator (TPE) — and understand its function programmatically through a step-by-step implementation in Python.

Figure 1: Animation of Grid Search and Random Search techniques.

TPE is a Bayesian optimization algorithm. This means it allows us to start with some initial beliefs about what our best model hyperparameters are, and update these beliefs in a principled way as we learn how different hyperparameters impact model performance. This is already a significant improvement over Grid Search and Random Search! Instead of determining the best hyperparameter set through trial and error, over time we can try more combinations of hyperparameters which lead to good models and fewer that do not.

TPE gets its name from two main ideas: 1. using Parzen Estimation to model our beliefs about the best hyperparameters (more on this later) and 2. using a tree-like data structure called a posterior-inference graph to optimize the algorithm runtime. In this example we will ignore the “tree-structured” part as it has nothing to do with hyperparameter tuning per se. Additionally, we will not go into the blow by blow of Bayesian statistics, expected improvement, etc. The goal here is to develop a high-level conceptual understanding of TPE and how it works. For a more thorough treatment of these topics see the original paper on TPE by J. Bergstra and colleagues [1].

Setting Up The Example

To build our implementation of TPE, we will need a toy example to work with. Let’s imagine we want to find the line of best-fit through some randomly generated data. In this example, we have two hyperparameters to tune — the line slope m and intercept b.

import numpy as np

#Generate some data
np.random.seed(1)
x = np.linspace(0, 100, 1000)
m = np.random.randint(0, 100)
b = np.random.randint(-5000, 5000)
y = m*x + b + np.random.randn(1000)*700
Figure 2: Randomly generated linear data to be used for TPE.

Since TPE is an optimization algorithm we also need some metric to optimize over. We will use root mean-squared error (RMSE). Let’s define the function rmse to calculate this metric as follows:

def rmse(m, b):
'''
Consumes coeffiecients for our linear model and returns RMSE.

m (float): line slope
m (float): line intercept
y (np.array): ground truth for model prediction
'''
preds = m*x + b
return np.sqrt(((preds - y)**2).sum()/len(preds))

The last thing we will need is some initial beliefs about what our best hyperparameters are. Let’s say we believe that the slope of the line of best-fit is a uniform random variable on the interval (10,100) and intercept is also a uniform random variable on the interval (-6000, -3000). These distributions are called priors. That they are uniform random variables is equivalent to saying we believe the true values of the best hyperparameters are equally likely to lie anywhere within their respective intervals. We will implement a class to encapsulate these variables, and use them to define our initial search space.

class UniformDist:
'''
Class encapsulates behavior for a uniform distribution.
'''
def __init__(self, min_, max_):
'''
Initializes our distribution with provided bounds
'''
self.min = min_
self.max = max_

def sample(self, n_samples):
'''
Returns samples from our distribution
'''
return np.random.uniform(self.min, self.max, n_samples)
#Define hyperparameter search space
search_space = {'m':UniformDist(10,100), 'b':UniformDist(-6000,-3000)}

With all this set up complete, we can move on to coding the algorithm itself.

Step 1: Random Exploration

The first step of TPE is to randomly sample sets of hyperparameters from our priors. This process gives us a first approximation of where the areas of our search space are that produce good models. The function sample_priors consumes our initial search space and a number of random samples to draw from it. It then evaluates the resulting models using our objective function rmse and returns a Pandas DataFrame containing the slope, intercept, and RMSE of each trial.

import pandas as pd

def sample_priors(space, n_samples):
'''
Consumes search space defined by priors and returns
n_samples.
'''
seed = np.array([space[hp].sample(n_samples) for hp in space])

#Calculate rmse for each slope intercept pair in the seed
seed_rmse = np.array([rmse(m, b) for m, b in seed.T])

#Concatenate and convert to dataframe
data = np.stack([seed[0], seed[1], seed_rmse]).T
trials = pd.DataFrame(data, columns=['m', 'b', 'rmse'])

return trials
Figure 3: Results of random sampling hyperparameters for 30 iterations. Each ‘x’ denotes a hyperparameter sample.

Step 2: Partitioning the Search Space and Parzen Estimation

After generating some initial samples from our priors, we now split our hyperparameter search space in two using a quantile threshold γ, where γ is between 0 and 1. Let’s arbitrarily choose γ=0.2. Hyperparameter combinations which result in a model that performs in the top 20% of all models we have created thus far get grouped into a “good” distribution l(x). All other hyperparameter combinations belong to a “bad” distribution g(x).

Figure 4: KDE of l(x) and g(x) distributions after 30 rounds of randomly selecting hyperparameters. Darker colored areas indicate higher density.

It turns out that the best next combination of our hyperparameters to test is given by the maximum of g(x)/l(x) (if you would like to see the derivation of this see [1]). This intuitively makes sense. We want hyperparameters which are highly likely under our “good” distribution l(x) and not very likely under our “bad” distribution g(x). We can model each of g(x) and l(x) using Parzen Estimators which is where the “PE” in “TPE” comes from. The rough idea of Parzen Estimation a.k.a. Kernel Density Estimation (or KDE) is that we are going to average across a series of normal distributions each centered on an observation belonging to g(x) or l(x) (respectively). The resulting distributions have high density over regions of our search space where samples are close together and low density over regions where samples are far apart. To perform Parzen Estimation we will use the KernelDensity object from the SKLearn library. The function segment_distributions consumes our trials DataFrame and our threshold γ and returns a Parzen Estimator for l(x) and g(x) each. The resulting distributions are visualized in Figure 4.

from sklearn.neighbors import KernelDensity

def segment_distributions(trials, gamma):
'''
Splits samples into l(x) and g(x) distributions based on our
quantile cutoff gamma (using rmse as criteria).

Returns a kerned density estimator (KDE) for l(x) and g(x),
respectively.
'''
cut = np.quantile(trials['rmse'], gamma)

l_x = trials[trials['rmse']<cut][['m','b']]
g_x = trials[~trials.isin(l_x)][['m','b']].dropna()

l_kde = KernelDensity(kernel='gaussian', bandwidth=5.0)
g_kde = KernelDensity(kernel='gaussian', bandwidth=5.0)

l_kde.fit(l_x)
g_kde.fit(g_x)

return l_kde, g_kde

Step 3: Determining the Next “Best” Hyperparameters to Test

As mentioned in Step 2, the next best set of hyperparameters to test maximize g(x)/l(x). We can determine what this set of hyperparameters are in the following way. First we draw N random samples from l(x). Then for each of those samples we evaluate their log-likelihood with respect to l(x) and g(x), selecting the sample which maximizes g(x)/l(x) as the next hyperparameter combination to test. The SKLearn KernelDensity implementation we decided to use makes this computation very easy.

def choose_next_hps(l_kde, g_kde, n_samples):
'''
Consumes KDE's for l(x) and g(x), samples n_samples from
l(x) and evaluates each sample with respect to g(x)/l(x).
The sample which maximizes this quantity is returned as the
next set of hyperparameters to test.
'''
samples = l_kde.sample(n_samples)

l_score = l_kde.score_samples(samples)
g_score = g_kde.score_samples(samples)

hps = samples[np.argmax(g_score/l_score)]

return hps

All Together Now

All that is left for us to do is string all of the previously discussed components together into a single algorithm and we have our implementation of TPE! The decisions we have to make here are how many rounds of random exploration we want to perform, how many total iterations of the algorithm we want to complete, and what our cutoff threshold γ will be (yes, even TPE has hyperparameters). Here are a few things to consider when choosing these quantities.

  1. If the “best” set of hyperparameters are not captured by your priors before you start TPE, the algorithm may have a difficult time converging.
  2. The more rounds of random exploration you perform, the better your initial approximations of g(x) and l(x) will be, which may improve the results of tpe.
  3. The higher the value of γ, the fewer samples will end up in l(x). Having only a few samples to use when estimating l(x) may lead to poor hyperparameter selection by tpe.
def tpe(space, n_seed, n_total, gamma):
'''
Consumes a hyperparameter search space, number of iterations for seeding
and total number of iterations and performs Bayesian Optimization. TPE
can be sensitive to choice of quantile cutoff, which we control with gamma.
'''

#Seed priors
trials = sample_priors(space, n_seed)

for i in range(n_seed, n_total):

#Segment trials into l and g distributions
l_kde, g_kde = segment_distributions(trials, gamma)

#Determine next pair of hyperparameters to test
hps = choose_next_hps(l_kde, g_kde, 100)

#Evaluate with rmse and add to trials
result = np.concatenate([hps, [rmse(hps[0], hps[1])]])

trials = trials.append(
{col:result[i] for i, col in enumerate(trials.columns)},
ignore_index=True
)

return trials

Results

To perform TPE over our synthetic data we created previously we run the following:

#Define hyperparameter search space
np.random.seed(1)
search_space = {'m':UniformDist(10,100), 'b':UniformDist(-6000,-3000)}

df = tpe(search_space,
n_seed=30,
n_total=200,
gamma=.2)

That’s it! We can now analyze our results. For the sake of brevity, the code used to generate the following visualizations will not be shown. However, the source code can be found in the GitHub repo for this project.

First let’s compare our best set of hyperparameters from TPE with the actual best slope and intercept from a regression solver.

Figure 5: Line of best-fit using hyperparameters obtained by TPE and Linear Regression.

As we can see from Figure 5, with TPE we are able to closely approximate the best set of hyperparameters for our line. We wouldn’t expect TPE to out-perform the regression solver because linear regression has a closed-form solution. However, over an infinite number of trials we would expect it to converge to a similar solution.

TPE is an optimization algorithm, so we don’t just care that we were able to find a decent set of hyperparameters. We also need to check that over the 200 iterations our objective function decreases. Figure 6 demonstrates that after our initial 30 random samples our TPE implementation proceeds to minimize our objective function with a clear(ish) trend.

Figure 6: RMSE across all TPE trials. Note the difference in y-axis scale between the left and right plots. Red dot indicates trial with lowest RMSE.

So far everything looks great! The last thing we want to check is how our beliefs about the “best” distribution of our hyperparameters changed across our 200 iterations.

Figure 7: Parzen Estimation of l(x) across 200 iterations of our TPE algorithm.

As we can see in Figure 7, we start with a very broad distribution of l(x) which rapidly collapses down to something approximating our end result. Figures 6 and 7 clearly illustrate how each of TPE’s three simple steps combine to give us an algorithm capable of exploring a search space in a complex yet intuitive way.

Conclusions

Hyperparameter tuning is a critical part of the modeling process. While Grid Search and Random Search approaches are easy to implement, TPE as an alternative provides a more principled way of tuning hyperparameters and is pretty simple from a conceptual perspective. Several Python libraries exist with very good implementations of TPE including Hyperopt (which was created and is maintained by the authors of [1]) and Optuna. Whether for something as simple as our toy example or as complex as hyperparameter tuning for a neural network, TPE is a versatile, effective, and simple technique that has been gaining popularity in Data Science and Machine Learning over the last few years. Next time you find yourself tuning your model hyperparameters — maybe skip the Grid Search.

References

  1. Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B., Algorithms for hyper-parameter optimization (2011), Advances in neural information processing systems, 24.

All images unless otherwise noted are by the author.

--

--

Sr. Data Scientist. Interested in doing cool stuff. Mostly failing - always learning.