Risk Implications of Excessive Multiple Local Minima during Hyperparameter Tuning

Our Epistemological Limitation and Illusion of Knowledge

Published in

Towards Data Science

14 min readOct 12, 2022

3D visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

Excessive multiple local minima during hyperparameter tuning is a symptom of a highly sensitive model performance to small changes in the value of hyperparameter, as displayed in the chart above.

I encountered this very rugged performance landscape with multiple dips and bumps when I was performing the grid-search tuning on the hyperparameter pair, reg_alpha and reg_lambda, of the native XGBoost API. It was just one component of my predictive regression analysis on property prices. Nevertheless, to a great extent, I was preoccupied with the following questions.

Are those multiple local minima signaling any risk implication?
If any, what are those?

In this reading, I would like to share my observations and thoughts on these questions based on the case study of my project.

A. Overview of the Hyperparameter Tuning

Here, I would like to make only a very brief overview of the process of the hyperparameter tuning that I performed in the original project.

a) Hyperparameters

I used the native XGBoost API for Python.

Among many hyperparameters built in the native XGBoost API, this analysis focuses on the following 8 hyperparameters, due to the given limited computational resource availability.

For the definitions of these hyperparameters, please refer to the documentation of the native XGBoost API.

Apart from these 8 hyperparameters, I tuned the number of iterations, num_boost_round, using early_stopping_round during the tuning process.

More on the XGBoost hyperparameters, here is a good summary written by Jason Brownlee: A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning.

b) Tuning Method

Ideally, one single joint tuning over all these 8 hyperparameters would accurately capture all the possible interactions among these hyperparameters. Nevertheless, the joint tuning has an enormous search-grid datapoints, thus, would be computationally expensive.

In order to economize the computational cost of the analysis, I made 4 parameter-pairs out of 8 : namely

(max_depth, eta)
(subsample, colsample_bytree)
(min_child_weight, gamma), and
(reg_alpha, reg_lambda) .

And I ran a pair-wise tuning for each of these 4 hyperparameter pairs. After each tuning, the best tuning result was used to replace the old values of the hyperparameter pair. In this way, every tuning incrementally improved the performance profile of the tuned model.

This altogether reduced the number of the search-grid datapoints overall and economized the computational cost of the tuning process at the expense of the impartiality of the grid search. Of course, the underlying assumption here was that the result obtained through the 4 pair-wise tunings would not materially deviate from the ideal one single joint tuning over all the 8 hyperparameters.

In terms of tuning methodology, here, I would like to highlight a few important footnotes.

I run k-fold cross validation over the training dataset for the hyperparameter tuning.
I used the grid search method for parameter tuning.
I used the 3D visualization with a triangulation-interpolation technique to fill the gaps in-between the search-grid datapoints and render a smoothed-out surface of the performance landscape.

c) First Tuning Result

Here, we have the 3D visualizations of the results of the 4 pair-wise hyperparameter tunings. Each chart captures the performance landscape of each pair-wise hyperparameter tuning.

The performance landscape of the hyperparameter pair of `max_depth` & `eta` — 3D visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

The performance landscape of the hyperparameter pair of `subsample` and `colsample_bytree`. — 3D visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

As you see, the last performance landscape demonstrated a very different behavior than the 3 other performance landscapes. It yielded multiple local minima and multiple dips and bumps and exhibited an extreme sensitivity to small changes in the values of the hyperparameter pair. On the contrary, the first 3 cases rendered relatively consistent performance landscapes.

Below, I would like to present the first search-grid setting of the hyperparameter pair.

for reg_alpha in [0, 1e-2, 0.1, 1, 2, 3, 4, 8, 10, 12, 14] for reg_lambda in [0, 1e-2, 0.1, 1, 2, 3, 4, 8, 10, 12, 14] ]

The intervals between datapoints are set unevenly on purpose to see the influence of the granularity of the search grid on the performance landscape. For both hyperparameters, I set a finer granular interval setting between 0 and 1 and a rougher setting in the range after 4.

The table below presents the top 10 best performance datapoints on the search-grid.

The table identifies the best performance on at the search-grid datapoint of (reg_alpha =0.1, reg_lambda=8.0); and the second best at (reg_alpha =3.0, reg_lambda=0.1).

In order to have a closer look at the rugged performance landscape of the hyperparameter pair of (reg_alpha and reg_lambda), I sliced the performance landscape along the reg_alpha at its values of the top 2 datapoints, namely at reg_alpha = 0.1 and 3.0.

For each slice, here I made a visualization of the sliced performance curve along reg_alpha and the table of the top 10 best performance search-grid datapoints on the sliced performance curve.

The performance curve sliced along reg_alpha=0.1 above confirms 3 local minima at (reg_lambda = 0.01, 1.00, and 8.00). There are salient dips and bumps between 0 and 2. In a way, the local minimum at reg_lambda = 0.01 and those dips and bumps would not have been captured if we did not have a granular local search-grid setting between 0 and 1.

Now, let’s look at another performance curve sliced along reg_alpha =3.0 and its top 10 performance search-grid datapoints.

The 2nd slice of the performance landscape along (reg_alpha = 3.0) identifies an acute abyss formed between (reg_lambda = 0.01) and (reg_lambda = 1.0). The intense asperity in the shape of the abyss — (along reg_lambda = 0.01, 0.1, and 1.00) — was captured by the local search grid setting which was much finer than any other part of the search grid. In other words, if we missed the single datapoint of (reg_lambda = 0.1) in our search-grid, we would have failed to capture the presence of the local minimum between (reg_lambda = 0.01) and (reg_lambda = 1.0).

d) Blindspot of the Grid-Search and the Visualization

These 2 sliced performance curves revealed a general pitfall of the grid-search: what the performance landscape captured are only the performance results projected on the selected search-grid. Call them on-the-search-grid performance; and we have no information whatsoever regarding the performance of any datapoint off the search grid. Call it off-the-search-grid performance.

In addition, the visualization made the situation worse, by smoothly interpolating the on-the-search-grid performances to speculate those invisible off-the-search-grid performances. The smoothed-out image over off-the-search-grid intervals was only an artefact, but not the fact at all. And the artefact could cause an illusion of knowledge in our precarious perception. In other words, we could deceive ourselves by blind-holding our vision with the artefact of the smoothed-out visualization.

A much finer granularity in the search-grid setting might reveal a much more intense asperity of the rugged performance landscape. That is to say, we might discover more dips and bumps by a more granular search-grid setting. It might uncover the presence of many other hidden local minima, which are invisible in the current performance landscape.

Furthermore, the ground-truth global minimum might well be present off the current search-grid and located at far distance from the current minimum identified by the current search-grid setting.

Simply put, the current performance landscape might be not only capturing a partial picture of its asperity, but also deceiving our vision.

e) 2nd round tuning of reg_alpha and reg_lambda.

To test the hypothetical view, I made a more granular search grid setting as below and re-run the pair-wise tuning for the particular hyperparameter pair.

for reg_alpha in [0, 1e-2, 2e-2, 3e-2, 0.1, 0.5, 1, 1.5, 1.8, 1.9, 2, 2.1, 2.2, 2.5, 3, 3.5, 4]for reg_lambda in [0, 1e-2, 2e-2, 3e-2, 0.1, 0.5, 1, 1.5, 1.8, 1.9, 2, 2.1, 2.2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5]

And here is the 3D visualization of the 2nd round tuning result.

As expected, the 2nd round tuning of the hyperparameter pair (reg_alpha and reg_lambda) rendered a far more rugged performance landscape with more dips and bumps than the 1st round tuning.

Here is the top 10 performance search-grid datapoints of the 2nd round tuning.

Now, this confirms that my earlier concern is valid.

First, a rough granularity of the 1st search-grid has underestimated the actual sensitivity of the performance landscape to small changes in the values of the hyperparameter pair.
The top 10 performance search-grid datapoint table reveals that the 2nd round tuning with a finer granularity discovered its best performance datapoint in a location (reg_alpha = 0.5, reg_lambda = 13.5) totally different from the 1st round tuning result, (reg_alpha = 0.1, reg_lambda = 8.0).

B. Risk Implications of Multiple Local Minima and Methodologies in use

The analysis used the grid search for conducting the hyperparameter tuning and the triangulation interpolation for rendering the 3D visualization of the performance landscape over hyperparameter values. Like any other methods, these methods have their inherent limitations embedded in their own design. And the result of the particular pair-wise tuning of the hyperparameter pair of reg_alpha and reg_lambda, demonstrating multiple local minima which is a symptom of a highly sensitive/volatile performance, exposed the risk of those inherent design limitations. In a way, the extreme sensitivity of the performance amplified and exposed the risk of using these methods.

Here is a recap of those limitations.

Limitation 1) Blindspot of the Grid-Search:

The search grid is set as a collection of selected discrete hyperparameter datapoints. Call them on-the-search-grid datapoints. The grid-search calculates the performance results only on those discrete hyperparameter datapoints in the selected search grid, or on-the-search-grid performance.
Tuning does not tell us what is going on in-between two on-the-search-grid datapoints next to each other. We have no knowledge whatsoever regarding the values of the performance (evaluation metric) off the selected search grid, or off-the-search-grid performance. Therefore, the tuning results only give us a partial picture of the underlying performance landscape at best.

Limitation 2) Illusionary Visualization

The 3D visualization captures only the on-the-search-grid results.
And the 3D visualization smoothly interpolates those discrete datapoints to create an artefact of estimated off-the-search-grid evaluation metric values. The artefact creates an illusion of knowledge regarding the off-the-search-grid evaluation metric values.

If the current search grid captures only a partial image of its sensitivity, the ground-truth off-the-search-grid performance might be far more sensitive/volatile and far more rugged with more dips and bumps than it appears on the smoothed-out visualization. In other words, its high sensitivity indicates that the ground-truth off-the-grid performance could materially deviate from the artefact of the interpolated estimates projected on the 3D visualization.

Moreover, the high sensitivity of performance landscape might further cast a risk that the ground-truth global minimum might be present off-the-search-grid and materially distant from on-the-search-grid global minimum. In this sense, multiple local minima, when compared with 3 other cases with only single local minimum, amplified the risk of suffering from those inherent limitations of the grid search and the triangulation interpolation technique of the visualization.

C. Bias-Variance trade-off in the Model Selection

Now, at this stage, given these 2 rounds of tuning results, we would like to make a tentative decision for the model selection to answer the following question. Which tuned model should we select: the 1st tuning result or the 2nd tuning result?

Can we say, the 2nd round tuning presents the better solution compared with the 1st round result, because it yielded a better performance during the tuning (the k-fold cross validation)?

Unfortunately, it’s not that simple. So far, we only used the training dataset for k-fold cross-validation.

In order to answer this question, we need to take into account of the perspective of bias-variance trade-off, or underfitting-overfitting trade-off. In other words, we need to run the tuned models on the test dataset and compare their performances.

Here is a table that compares the performances between the cross validation dataset and the test dataset — “Tuning (Validation MAE)” and “Test Performance (MAE)” respectively — for 3 stages: the pre-tuning stage, the post-1st pair-wise tuning, and the post-2nd pair-wise tuning.

The table suggests that the best test performance was actually yielded at the 1st round tuning, not at the 2nd round tuning. This indicates that the better cross-validation performance result during the 2nd round tuning accounted for variance, or overfitting.

In order to address bias-variance trade-off, we have to select the 1st round result and reject the 2nd round result for the model selection.

For data-driven machine learning, the model instability is a critical problem arising from bias-variance trade-off, or underfitting-overfitting trade-off. A model well-tuned during the development stage could exhibit unexpected behavior when a new type of dataset emerges in the deployment domain and comes into the well-tuned model. The stability of the model performance needs to be assessed throughout the life cycle of the model risk management.

D. Discussion

All that said, whether the first-round tuning result is the best solution in an absolute sense remains uncertain. Given the extraordinary sensitivity of the model, the granularity of the current search-grid might not be fine enough to capture the ground-truth global minimum. There might be better solutions (even for the test dataset) present off the search-grid used in the 1st round and 2nd round.

Shall we set a new search grid to test this doubt?

It would be a never-ending story.

Above all, we can never know the best solution in an absolute sense, when we use the grid search parameter tuning. At best, a grid-search can capture only a partial aspect of the performance landscape that are projected only on the search grid. We have no information whatsoever regarding the off-the-search-grid performance.

In this sense, we do not have the whole picture of the entire performance landscape. In principle, our knowledge is limited to those discrete on-the-search-grid datapoints. And to make it worse, the triangulation interpolation visualization technique renders an artefact of the continuous performance landscape and creates an illusion in our precarious perception about the entire performance landscape. Overall, it would be naïve for us to believe that we can attain the whole picture of the ground-truth performance landscape, thus, the ground-truth global minimum.

Apart from those two limitations repeatedly aforementioned, there is another obvious limitation of the selected tuning approach, namely the pair-wise tuning.

Limitation 3) Partiality of the pair-wise tuning

A single joint tuning of all the hyperparameters would better capture the actual performance interactions among all the hyperparameters. And it might generate a totally different conclusion. In this sense, the global minimum captured by the pair-wise tuning could be inaccurate and different from the ground-truth global minimum.
With only 4 pair-wise tunings, we are not covering all the possible combination pairs (28=8×7/2) for 8 hyperparameters. Interactions among those missing combinations are not captured by the 4 pair-wise tunings.

A pair-wise tuning of reg_alpha and reg_lambda of the native XGBoost API demonstrated a highly sensitive model performance to small changes in the values of the hyperparameter pair and revealed the epistemological limitations of the grid search and the interpolated 3D visualization.

The grid search can only capture the model performances over the search-grid, leaving us no information whatsoever regarding the performances off the search-grid.
To make the matter worse, the triangular interpolation, by smoothing out the performance landscape, blind-holds our vision and creates an illusion of knowledge regarding the actual model performances off-the-grid.

In a nutshell, both of these limitations are embedded on the design of these methods. Therefore, they are architecturally inherent and cannot be resolved. That sets an epistemological limitation: we will never attain our perfect knowledge of the ground-truth performance landscape, thus, the ground-truth global minimum. That would lead to a relevant question:

If it were epistemologically impossible to locate the ground-truth global minimum, would it be possible to establish its robust proxy? What constitute(s) a robust proxy for the global minimum?

Moreover, the second round tuning with a more granular search-grid setting on the hyperparameter pair revealed a complex issue of bias-variance trade-off. Although it improved the training performance, it deteriorated the test performance. Unfortunately, the more granular grid search only over-fitted the model to the training dataset and failed to optimize the general model performance. This raised another relevant question:

Does the ground-truth global minimum matter at all from the perspective of the general optimization of the tuned model?

Another point is: in the analysis the epistemological limitation did not seem to be problematic for 3 other performance landscapes — namely for the following 3 pairs: (max_depth, eta), (subsample, colsample_bytree), (min_child_weight, gamma) — because they were relatively consistent. Clearly, the epistemological limitation came under our radar only with the highly sensitive performance landscape of reg_alpha and reg_lambda.

That tells us something: as long as the performance landscape did not demonstrate a high sensitivity to small changes in the values of those hyperparameters, the combination of the triangular interpolation together with the grid search rather appeared to have served as a robust solution.

That translates into another perspective: beyond the general epistemological limitations of these methods, there are much severer problem of an excessively sensitive performance landscape. The risk implications of excess multiple local minima need to be addressed. And a relevant question would be:

Would multiple local minima indicate the model instability of the tuned model in the deployment domain, where new types of dataset could be introduced into the model?

What do you think?

None of these questions seems to have a self-evident answer.

D. Conclusion

Now, I will close this post while keeping these questions unanswered. Going forward, I would like to continue monitoring the risk implications of excessive multiple minima and exploring those unanswered questions. Moreover, I would welcome the readers to reach me to share their views on this matter.

Overall, this post revealed our epistemological limitation in finding the ground-truth answer for the global minimum of the performance landscape during the hyperparameter tuning.

The grid-search only identified the model performance on the search-grid, leaving off-the-search-grid performances unknown. Making the situation worse, the 3D visualization rendered an artefact of the triangulation interpolation over the off-the-search-grid datapoints and created an illusion of knowledge.

Moreover, whether we need to attain our knowledge about the ground-truth global minimum is proven not self-evident from the perspective of bias-variance trade-off.

Thanks for reading this post.

If you are interested in the entire process of the original analysis, you can visit my Github repository to see the code implementation. As a footnote, to generate all those images, I used python and Matplotlib.

ACKNOWLEDGEMENT

I would like to express my gratitude towards TDS’s editing team, especially Katherine Prairie, for their invaluable editing advices during the editing stage.

References

Brownlee, J. A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning. (2016). Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
Wikipedia. Bias–variance tradeoff. (2022). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#:~:text=In%20statistics%20and%20machine%20learning,bias%20in%20the%20estimated%20parameters.
Wikipedia. Cross-validation (statistics). (2022). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation
Wikipedia. Surface triangulation. (2022). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Surface_triangulation
xgboost developers. XGBoost Parameters. (2021). Retrieved from https://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-tree-booster