Early Stopping

Charles Brecque
Towards Data Science
4 min readOct 8, 2018

--

Sometimes it isn’t worth going to the end, especially in hyper-parameter tuning

Most Machine Learning models have hyper-parameters which are fixed by the user in order to structure the training of these models on the underlying data sets. For example, you need to specify the depth and number of trees (among other hyper-parameters) when training a random forest. There are also many other “real world” examples of hyper-parameters. Once the hyper-parameters have been set, the training of the models is tractable with standard optimizers such as gradient descent. Poorly specified hyper-parameters can lead to longer training times or biased models. The quality of the chosen hyper-parameters is often measured on a held-out test data set.

Bayesian Optimization is an approach which aims to identify the optimal hyper-parameters in as few iterations as possible but each iteration requires the training of the model for a given hyper-parameter configuration to be complete before it can assess the quality of the hyper-parameters and move on to the next iteration. In many hyper-parameter tuning problems, a Data Scientist might know ahead that certain configurations won’t lead to great results before the training has even finished (or started!). In this article we will look at an approach, called Freeze Thaw¹, that systematises the “early stopping” of hyper-parameter tuning in Bayesian Optimization. As a result, this approach can lead to a more efficient usage of compute resources in hyper-parameter tuning as well as a drop in the overall tuning time.

Freeze Thaw Bayesian Optimization

Freeze Thaw¹ uses the partial information acquired during the process of training a machine learning model in order to decide whether to pause training and start a new model, or resume the training of a previously-considered model.

The partial information provided during the model training is essentially the training loss of the model which indicates how bad the model’s prediction was on the training data. An example is the Mean square error which is the average squared loss per example and is given by the following formula:

For example, the optimal weights of a linear regression minimize the loss which is why in Figure 1, the loss is much higher (and the model a lot worse) than in the second case.

Figure 1: loss of the trained model (red) for linear regression models (blue) trained on data points (yellow)(source)

Freeze Thaw works on the assumption that the training loss for most Machine Learning models roughly follows an exponential decay towards an unknown final value. This assumption is encoded in the prior of their Bayesian Optimisation approach, which allows them to forecast the final result of partially trained models.

In more simple terms:

  • Bayesian Optimization builds a surrogate model of the underlying function we are optimizing
  • In this case, it is a Gaussian Process which is built by observing values of the function and predicting the other points of the domain.
  • In Freeze Thaw BO, the partial information from the model training is used to predict the final loss which is then factored into the surrogate model
  • In other words, the surrogate model is a prediction of what it would look like at the end of the model training

Once the surrogate has been predicted, the Bayesian Optimization strategy is adapted to “basket” a basket of B = B_old + B_new candidate models which have been partially trained or not with different hyper-parameter values. The Bayesian Optimization procedure is then to determine which new configurations to try and which “frozen” configurations to resume. A more detailed explanation can be found in their paper.

The performance gains of Freeze Thaw over alternative Bayesian Optimization approachs on well known problems is illustrated in figure 2.

FT vs GP EI MCMC on MNIST (a), LDA (b), PMF/Movie Lens (c)

The gains are significant but heavily rely on the assumption that the training loss follows an exponential decay. Extensions to this work may study models where this assumption does not hold. Early stopping is a significant field of research at Mind Foundry, that will very shortly be implemented within our API for Bayesian Optimization, OPTaaS.

[UPDATE: I have started a tech company. You can find out more here]

1: K. Swersky, J. Snoek, and R. P. Adams. Freeze-thaw Bayesian optimization.
arXiv:1406.3896
[stat.ML]

--

--