Hyper-parameter tuning is required whenever a Machine Learning model is trained on a new data-set. Nevertheless, it is often foregone as it lacks a theoretical framework which I have previously tried to demystify here:
One approach which systematises intelligent and efficient hyper-parameter tuning is Bayesian Optimization which builds a probabilistic surrogate of the tunable problem to recommend optimal parameters. It gradually builds up its understanding of the problem by updating the surrogate after each iteration. The following figure illustrate the evolution of a surrogate model which was generated by OPTaaS to minimize the Beale function .



As we can see in the first plot (5 iterations), Bayesian Optimization faces a cold-start problem because the prior is initially flat and needs to build up a representation of the underlying function before it can provide "meaningful" recommendations. After 35 iterations, it has a better understanding of the Beale function; the surrogate looks very much like it. In this article we will see how Warm Starting the surrogate model can dramatically improve performance.
Why Warm Start?
From the previous illustrative example, we understand that if the Bayesian Optimizer had some prior information about the overall shape of the function or its type, the recommendations could have been better early on as it wouldn’t have needed to spend as much time initialising the surrogate.
Moreover, in the context of hyper-parameter tuning, certain hyper-parameter configurations of Machine Learning models might be legal but won’t make sense in practice. For example, a Random Forest with a low number of trees will have a low accuracy so it is probably not worth exploring hyper-parameter configurations in such regions regardless of the underlying data-set it is being trained on.
Warm Starting for Random Forests
We are going to study the performance improvements for warm starting Bayesian Optimization for a Random Forest. By running a search over many datasets and random forests with different hyperparameter configurations, we are able to obtain an idea of how a random forest’s performance varies on average with each hyper-parameter.
With the knowledge from these tests, we can guide the optimizer to search in areas where the model has historically performed well, avoiding trialling historically poor configurations such as using a low number of trees.
Results
For the performance comparison, we will use OPTaaS, a general purpose Bayesian Optimizer and we will compare it against a warm-started version on brand new data-sets it hasn’t seen before. We ran the tests on 30 brand new data sets and the following plots show the results on the CMC data set, the German credit data set and the Sonar data set.



As we can see, the warm started OPTaaS helps identify better hyper-parameter configurations a lot faster than the cold started version. The latter does catch up (reassuringly) but requires more iterations to build up its understanding of the underlying problems. Understandably, precautions need to be taken to make sure that the warm started configurations are not over-fitted to the training data sets in order to guarantee generalisable performance improvements.
[UPDATE: I have started a tech company. You can find out more here]
Extensions
Warm starting surrogates presents a competitive advantage in the initial iterations by providing "reasonable" configurations to be tried first. However, there are a number of extensions that can help improve performance after the initial iterations. I will detail them in my next article.
In the mean time, don’t hesitate to reach out if you have any questions or whether you would like to try OPTaaS.