The world’s leading publication for data science, AI, and ML professionals.

How to choose the best model?

Do not choose, use them all!

Image of quimono from Pixabay.
Image of quimono from Pixabay.

With this advertising title, I would like to draw your attention to a common problem in Machine Learning (ML): the choice of the good model. I will not describe all the statistical methods that have been developed for model selection. For the most curious readers I suggest the reading of [3]. In this article I would like to talk about less known methods: the theory of aggregation of experts.

When I talk to other data scientists, most of them have never heard of it. Therefore, I decided to write this article to shortly describe these methods. It is based on the book "Prediction, Learning and Games" [1], which is nicknamed "The Red Bible" and the introduction of the thesis of Pierre Gaillard [2].


The main idea behind the aggregation of experts

Initially, the theory of aggregation of experts has been developed to make sequential predictions of a variable y at different times t. For this purpose, we assume that at each time t we have a group of K experts who make predictions. The idea is to aggregate the predictions sequentially to produce an aggregated prediction ŷ.

One of the most important points is that it is not necessary to know the procedure by which these predictions are made. They could be generated by machine learning algorithms or be expert advice or even predictions from your fortune-teller, whatever.

To create ŷ, we perform a convex combination (i.e. a weighted sum where the sum of the weights equals 1) of the experts’ predictions. Once the true value of y is known, we adjust the weights of the individual experts according to the difference between their prediction and the true value of y. The worse the prediction, the lower the weight. Thus, the more an expert is wrong, the less important it will be in the aggregated prediction. And so on until you decide to renew your expert group.

Depending on what you compare your aggregated model to, you will use different strategies to update the experts’ weights. In fact, there are different goals in the theory of aggregation of experts: to make an aggregated model competitive with the best experts in your group (model selection (MS) problem), to make an aggregated model competitive with the best convex combination of your experts (convex combination (CC) problem), or even to make an aggregated model competitive with the best sequential convex combination of your experts.

In the following, I need to introduce some mathematical formalism to present the main aggregation strategy: exponentially weighted aggregation. So if you are allergic to mathematical formulas, go directly to the conclusion and share this article with your data scientists team.


Sequential prediction with expert advices

To evaluate the performance of a prediction we need a measure. In the theory of aggregation of experts it is called the loss function, denoted ℓ. It is a positive function which takes two arguments. The first one is a prediction and the second one is a realization. This function is assumed to be convex on its first argument. For example, we have the convex square loss ℓ(x, y) := (x-y)². We can describe the sequential prediction with expert advices with the following steps:

The main objective of a statistician is to minimize the cumulated loss of the aggregated prediction. It is defined by

We can decompose the formula of the cumulated loss as follows

Thus, we obtain the classical decomposition in ML. The right term is named cumulated regret. It reflects the regret of not knowing the best expert in advance. So, to minimize the cumulated loss, we have to control the cumulated regret.

The exponentially weighted aggregation strategy (presented below) guarantees an average cumulated regret of the order of

We can prove that this bound is optimal, but is not the purpose of this article. I refer the most curious among you to Chapter 2.2 of [1].


Exponentially weighted aggregation strategy

As mentioned in the introduction the aggregated prediction is a convex combination of the experts’ predictions. For each date t we have

And the weights are sequentially computed following the formula

where η > 0 is called the learning rate. To calibrate this parameter we use the fact that if the loss is convex on its first argument and is valued in [a, b] then the cumulated regret of this strategy is uniformly bounded by

Hence, this bound is minimal for

You could tell me that in practice we do not know the parameters T, a and b. And it is true. However, the expert aggregation theory has been developed to be used in practice. Therefore, there are always techniques to get around this kind of difficulty. If you want to read more about the calibration of η we refer to Chapter 2.3 of [1].


Conclusion

The theory of aggregation of experts is an amazing theory, and it is driven by practice. It allows the aggregation of multiple forecasting models to get the most out of each one. For example, EDF (France’s main electricity generation and distribution company) uses more than forty models to predict the electricity consumption of the French people. One of the main advantages is that by aggregating experts, very different models can be used simultaneously without having to worry about how they were created. Therefore, very general models can be combined with very specialized models. It’s the best of both worlds.

Right now, the theory of aggregation of experts is not very well known. But I think it’s a very interesting approach today, where the number of algorithms is multiplying and the selection depends more and more on time-changing parameters. The best model of today may not be the best model of tomorrow and vice versa.

Therefore, from now on, do not choose, use them all!


About Us

Advestis is a European Contract Research Organization (CRO) with a deep understanding and practice of statistics, and interpretable machine learning techniques. The expertise of Advestis covers the modeling of complex systems and predictive analysis for temporal phenomena.

LinkedIn: https://www.linkedin.com/company/advestis/


References

[1]N.Cesa-Bianchi, and L.Gábor, Prediction, learning, and games __ (2006), Cambridge university press.

[2] P.Gaillard, Contributions to online robust aggregation: work on the approximation error and on probabilistic forecasting. Applications to forecasting for energy markets (2015), Université Paris-Sud 11.

[3] P. Massart, Concentration inequalities and model selection __ (2007), Vol. 6, Berlin: Springer.


Related Articles