The world’s leading publication for data science, AI, and ML professionals.

Probabilistic Machine Learning Series Post 2: Model Comparison

How to compare two models ?

source: https://www.pexels.com/photo/ask-blackboard-chalk-board-chalkboard-356079/
source

Many steps must be followed to transform raw data into a machine learning model. Those steps may be hard for non-experts and the amount of data keeps growing. A proposed solution to the artificial intelligence skill crisis is to do Automated Machine Learning (AutoML). Some notable projects are the Google Cloud AutoML and the Microsoft AutoML. The problem of automated machine learning consists of different parts: neural architecture search, model selection, features engineering, model selection, hyperparameter tuning and model compression. In this post, we will be interested in model selection.

Model selection could be seen as a trivial task, but we will see that many metrics are needed to get a full picture of the quality of the model. The usual metric that comes to mind when selecting a model is the accuracy, but other factors need to be taken into account before moving forward. To explore this question, we will compare two similar model classes for the same dataset.

In a previous post, we were able to do probabilistic forescasts for a time series. This raises the question of whether the probabilities predicted correpond to empirical frequencies which is called model Calibration. Intuitively, for a classification problem, we would like that for the prediction with 80% confidence to have an accuracy of 80%. One might wonder why accuracy is not enough at the end. If the results are used in a decision process, overly confident results may lead to higher cost if the predictions are wrong and loss of opportunity in the case of under-confident predictions. For example, let’s suppose that we have a model to predict the presence of precious minerals in specific regions based on soil samples. Since exploration drilling for precious minerals can be time consuming and costly, the cost can be greatly reduced by focusing on high confidence prediction when the model is calibrated.

As Justin Timberlake showed us in the movie In Time, time can can be a currency so the next aspect that we will compare is the time needed to train a model. Despite the fact that we will use small dataset(i.e. the classical Iris data set), there is many reasons to keep track of the time needed to train a model. For example, some model testing technique based on resampling (ex: cross-validation and bootstrap) need to be trained multiple times with different samples of the data. Thus, the model will not be trained only once but many times. When the algorithm will be put into production, we should expect some bumps on the road (if not bumps, hopefully new data!) and it is important to know how much time it will take to retrain and redeploy the model. A good estimate of the time needed to train a model will also indicates if investment in bigger infrastructure is needed.

The final aspect (in the post) used to compare the model will be the prediction capacity/complexity of the model using the Widely-Applicable Information Criterion (WAIC). The criterion can be used to compare models on the same task that have completely different parameters [1]. It is a Bayesian version of the standard AIC (Another Information Criterion or Alkeike Information Criterion).Information criterion can be viewed as an approximation to cross-validation, which may be time consuming [3].

What are the data ?

The data set used is a now a classic of machine learning: the Iris Classification problem. The classification is based on the measurements of sepal and petal. The data were introduced by the British statistician and biologist Robert Fisher in 1936. In the next figure, the distribution of the lengths and widths are displayed based on the species. A linear classifier should be able to make accurate classification except on the fringe of the virginica and versicolor species.

Pairplots of the features which will be used for the classification. A linear classifier should be able to make accurate classification except on the fringe of the virginica and versicolor species.
Pairplots of the features which will be used for the classification. A linear classifier should be able to make accurate classification except on the fringe of the virginica and versicolor species.

What are the models that we want to compare?

A linear classifier will be trained for the classification problem. At first, a μ is calculated for each class using a linear combinaison of the features.

Equation for the calculation of μ for each of the k class using the N features.
Equation for the calculation of μ for each of the k class using the N features.

The μ for each class it then used for our softmax function which provide a value (pₖ) between zero and one. This value (pₖ) will be the probability for the class indexed k.

Softmax function for the probability p of belowing to the k class based on the previous calculation of μ.
Softmax function for the probability p of belowing to the k class based on the previous calculation of μ.

In the first model, the β’s are all constant and equal to one. This will be called the model without temperatures (borrowing from the physics terminology since the function is anagolous the partition function in statistical physics). The second model will have a different β for each class which will add a little complexity to the model (more parameters) but hopefully will also give better results. The usage of temperature for calibration in Machine Learning can be found in the litterature [4][5]. Changing the temperatures will affect the relative scale for each μ when calculating the probabilities. As an example, we will suppose that μ₁ = 1, μ₂ = 2 and μ₃ = 3. By fixing all the initial temperatures to one, we have the probabilities p₁ = 0.09, p₂ = 0.24 and p₃ = 0.67. Let’s now keep the same temperatures β₂ = β₃ = 1 but increase the first temperature to two (β₁ = 2). The resulting probabilities have shifted to p₁ = 0.21, p₂ = 0.21 and p₃ = 0.58. Finally, if we reduce the first temperature to 0.5, the first probability will shift downward to p₁ = 0.06 and the others two will adjust to p₂ = 0.25 and p₃ = 0.69.

Summary of the resulting probabilities for differents temperatures by keeping the μ values fixed.
Summary of the resulting probabilities for differents temperatures by keeping the μ values fixed.

We represented the dependence between the parameters and the obervations in the following graphical model. The shaded circles are the observations. The z’s are the features (sepal length, sepal width, petal length and petal width) and the class is the species of the flower which is modeled with a categorical variable. The squares represent deterministic transformations of others variables such as μ and p whose equations have been given above. The circles are the stochastic parameters whose distribution we are trying to find (the θ’s and β’s). The boxes mean that the parameters are reapeated a number of times given by the constant at the bottom right corner.

Graphical representation of the model with temperatures.
Graphical representation of the model with temperatures.

Which model gives the best accuracy ?

Despite that it is not the only important characteristic of a model, an inaccurate model might not be very useful. The accuracy was calculated for both models for 50 different trains/test splits (0.7/0.3). This was done because we wanted to compare the model classes and not a specific instance of the learned model. For a same model specification, many training factors will influence which specific model will be learned at the end. One of those factors will be the training data provided. Since the data set is small, the training/test split might induce big changes in the model obtained. As we can see in the next figure, the accuracy is on average slightly better for the model with temperatures with an average accuracy on the test set of 92.97 % (standard deviation: 4.50 %) compared to 90.93 % (standard deviation: 4.68 %) when there are no temperatures.

Distribution of accuracies of fifty different random train/test split for the model with and without temperatures. The model with temperatures has an average accuracy on the test set of 92.97 % (standard deviation: 4.50 %) compared to 90.93 % (standard deviation: 4.68 %) when there is no temperatures. The vertical lines are the corresponding means.
Distribution of accuracies of fifty different random train/test split for the model with and without temperatures. The model with temperatures has an average accuracy on the test set of 92.97 % (standard deviation: 4.50 %) compared to 90.93 % (standard deviation: 4.68 %) when there is no temperatures. The vertical lines are the corresponding means.

What are the distributions of my parameters ?

In the next two figures, we notice that the distribution of some the θ’s from the model with temperatures are more spread out than the ones from the model without temperatures. We usually want the values to be as peaked as possible. More spread out distribution means more Uncertainty of the parameter value. The usual culprits that wehave encountered are bad priors, not enough sampling steps, model misspecification, etc. Since we want to compare the model classes in this case, we will keep those parameters fixed between each model training so only the model will change.

Distribution of the parameters for the model without temperatures. There is no distribution for the temperatures since the value is always a constant set to one.
Distribution of the parameters for the model without temperatures. There is no distribution for the temperatures since the value is always a constant set to one.
Distribution of the parameters for the model with temperatures. There is a distribution for each blank circle in the graphical representation seen above.
Distribution of the parameters for the model with temperatures. There is a distribution for each blank circle in the graphical representation seen above.

Are my probabilities trustworthy ?

A probabilistic model can only base its probabilities on the data observed and the allowed representation given by the model specifications. In our example, we can only separate the classes based on a linear combination of the features. If this is not achievable, not only the accuracy will be bad, but we the calibration should not be good either. Before using those metrics, other signs based on the samples of the posterior will indicate that the model specified is not good for the data at hand. To measure the calibration, we will use the Static Calibration Error (SCE) [2] defined as

The SCE [2] can be understood as follows. Separate the predictions in B time K bins where B in the number of confidence interval used for the calculation (ex: between 0 and 0.1, 0.1 and 0.2, etc) and K is the number of class. For each of those bins, take the absolute deviation between the observed accuracy, acc(b,k), and the expected accuracy, conf(b,k). Take the weighed sum of the confidence intervals bins with respect to the number of predictions in those bine. Finally, take the class average of the previous sum. The calibration curve of two trained models with the same accuracy of 89 % is shown to better understand the calibration metric.

An exemple of calibration curve for two model with the same accuracy of 89 %. The model with the temperatures (blue) is better calibrated than the one without temperatures (orange). The green line is the perfect calibration line which means that we want the calibration curve to be as close as possible to this line. The dots are the values obtained for each of the eight intervals used.
An exemple of calibration curve for two model with the same accuracy of 89 %. The model with the temperatures (blue) is better calibrated than the one without temperatures (orange). The green line is the perfect calibration line which means that we want the calibration curve to be as close as possible to this line. The dots are the values obtained for each of the eight intervals used.

The green line is the perfect calibration line which means that we want the calibration curve to be close to it. If we look at the high confidence prediction (0.70 and up), the model without temperature has a tendency to underestimate its confidence and to overestimate its confidence in the lower values (0.3 and down). The model with temperatures is generally better calibrated (mean SCE of 0.042 with a standard deviation of 0.007) than the model without temperature (mean SCE of 0.060 with a standard deviation of 0.006).

Distribution of SCE (a deviance calibration metric) of fifty different random train/test split for the model with and without temperatures. The model with temperatures (blue) is generally better calibrated (mean SCE of 0.042 with standard deviation of 0.007) than the model without temperature (orange) (mean SCE of 0.060 with standard deviation of 0.006). The vertical lines are the corresponding means.
Distribution of SCE (a deviance calibration metric) of fifty different random train/test split for the model with and without temperatures. The model with temperatures (blue) is generally better calibrated (mean SCE of 0.042 with standard deviation of 0.007) than the model without temperature (orange) (mean SCE of 0.060 with standard deviation of 0.006). The vertical lines are the corresponding means.

How long I have to wait ?

As expected, the model with temperatures, which is more complex, takes more time to make the same number of iterations and samples. It took, on average 467 seconds (standard deviation of 37 seconds) to train the model with temperatures compared to 399 seconds (standard deviation of 4 seconds) for the model without temperatures.

Computation times of fifty different random train/test split for the model with and without temperatures. It took on average 467 seconds (standard deviation of 37 seconds) to train the model with temperatures (blue) compared to 399 seconds (standard deviation of 4 seconds) for the model without temperatures (orange). The vertical lines are the corresponding means.
Computation times of fifty different random train/test split for the model with and without temperatures. It took on average 467 seconds (standard deviation of 37 seconds) to train the model with temperatures (blue) compared to 399 seconds (standard deviation of 4 seconds) for the model without temperatures (orange). The vertical lines are the corresponding means.

What is the complexity added ?

In this experiment, we compare the simpler model (without temperature) to a more complex one (with temperatures). An interesting metric to use is the Widely-Applicable Information Criterion which is given by

where LPPD is the log pointwise predictive density and P is the effective number of parameters. The WAIC is used to estimate the out-of-sample predictive accuracy without using unobserved data [3]. The lower the WAIC, the better since if the model fit well the data (high LPPD) the WAIC will get lower and an infinite number of effective parameters (infinite P) will give infinity. A model with an infinite number of effective parameters would be able to just memorize the data and thus would not be able to generalize well to new data. It is thus subtracted to correct the fact that it could fit the data well just by chance. The factor 2 comes from the historical reasons (it naturally comes from the original derivation of the Akaike Information Criterion based on the Kullback-Leibler divergence and the chi-squared distribution). The LPPD (log pointwise predictive density) is estimated with S samples from the posterior distribution as defined below

The numbers of effective parameters is estimated using the sum of the variances, with respect to the parameters, of the log-likelihood density (also called log predictive density) for each data point [3].

As we can see in the next figure, the WAIC for the model without temperatures is generally better (i.e. lower). One of the reasons might be the high variance of some of the parameters of the model with temperatures which will induce a higher effective number of parameters and may give a lower predictive density. One might expect the effective number of parameters between the two models to be the same since we can transform the model with temperature to the model without temperature by multiplying the θ’s by the corresponding β’s but the empirical evidence suggest otherwise.

Widely Applicable Information Criterion (WAIC) of fifty different random train/test split for the model with and without temperatures. The model without temperatures (orange) has a better WAIC, 247 with standard deviation of 12 , compared to the model with temperatures (orange), 539 with standard deviation of 60. The vertical lines are the corresponding means.
Widely Applicable Information Criterion (WAIC) of fifty different random train/test split for the model with and without temperatures. The model without temperatures (orange) has a better WAIC, 247 with standard deviation of 12 , compared to the model with temperatures (orange), 539 with standard deviation of 60. The vertical lines are the corresponding means.

Conclusion

The next table summarizes the results obtained to compare the two model classes for the specific task. The model with temperatures has a better accuracy and calibration, but takes more computing time and has a worse WAIC (probably caused by the variance in the parameters).

Summary of the mean results obtained for the metrics evaluated. The value in parenthesis is the standard deviation. The green dot indicate the better value of the two model classes while the red dots the worst value.
Summary of the mean results obtained for the metrics evaluated. The value in parenthesis is the standard deviation. The green dot indicate the better value of the two model classes while the red dots the worst value.

Since the computing time is not prohibitive compared to the gain in accuracy and calibration, the choice here is model with temperatures. Before putting it into production, one would probably gain by fine tuning it to reduce the uncertainty in the parameters where possible. One has to remember that the uncertainty also may give a higher calibration by avoiding overconfidence. We see that to get a full picture of the quality of a model class for a task, many metrics are needed. In the case of AutoML, the system would automatically use those metrics to select the best model. As we saw, we can gain by interpretating them according to the need of the user and the cost associated with the model usage. Fortunately for the data scientist, this also means that there is still a need for human jugement.

Thanks for reading

References and suggested readings

[1] A.Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin, Bayesian Data Analysis (2013), Chapman and Hall/CRC

[2] J. Nixon, M. Dusenberry, L. Zhang, G. Jerfel, D. Tran, Measuring calibration in deep learning (2019), ArXiv

[3] A. Gelman , J. Hwang, and A. Vehtari, Understanding predictive information criteria for Bayesian models (2014), Springer Statistics and Computing

[4] A. Vehtari, A. Gelman, J. Gabry, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC (2017), Springer Statistics and Computing

[5] A. Sadat Mozafari, H. Siqueira Gomes, W. Leão, C. Gagné, Unsupervised Temperature Scaling: An Unsupervised Post-Processing Calibration Method of Deep Network (2019), ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning


Related Articles