Fitness navigator

5 ways to improve model quality

Kirill Tsyganov
Towards Data Science

--

You cannot get a great physical shape if you only do Abs. Instead, you can appear at the beach much more quickly and with a smile on your face if you create a plan with your fitness coach and follow it diligently.

The same thing with machine learning models. Everyone knows you have to tweak model parameters (e.g. neural network weights) to obtain a good model. But it’s not the only option to influence the model quality. We’ll take a step back to explore other ways of improvement. Some of them are getting widely used in certain machine learning problems but others remain under the radar.

Below is my reflection on a great paper by J. Kukačka, V. Golkov and D. Cremers, Regularization for Deep Learning: A Taxonomy (2017) [1].

What is the optimal fit?

When we look at the picture below, it becomes clear what is a “good” fit: model captures general data patterns, but is not too sensitive to the patterns unjustified by the given data.

(1) Optimal fit. Image by author

Model complexity, i.e. the number of model parameters, controls model’s capacity to learn sophisticated data patterns. But when you feed a simple data into a complex model it results in overfitting — model tries to utilize its “redundant” parameters and eventually “memorizes” data points. And memorization is generally bad when the model makes an inference/prediction about a new data point. Underfitting occurs in the opposite situation — when a simple model is fed with a complex data. It is also bad because the underfitted model isn’t capable of capturing complex data patterns.

Optimal fit is obtained when the model complexity is in balance with the data complexity. Achieving the balance is a challenge since the model learns patterns in train data sample but the final objective is to have a good quality of predictions on new data samples, i.e. test data whose distribution can be different. But as we see below there’re many ways to approach this balance.

(2) Training-test error tradeoff. Image by author

The guiding principle of achieving the optimal/best fit is called Structural Risk Minimization [2]. It controls a tradeoff between the model complexity (regularization) and the quality of fitting the training data (empirical risk/error). The word risk refers to the abstract price we might have to pay for inaccuracy of predictions.

(3) Structural risk minimization. Image by author

How to measure the quality of fit?

We can quantify our perception of fit by aggregating deviations between the actual data and the model predictions. Each deviation, i.e. error represents a distance in a feature space and the definition of distance depends on the problem context and domain.

(4) Example of error functions in different feature spaces. Image by author

We’re quite flexible in choosing error/loss function. However in practice there’re limitations induced by optimization algorithms that we use for adjusting model parameters to minimize the error — usually they can work only with globally continuous and differentiable loss functions.

Additionally to the errors, we might want to account for our preferences of the model architecture or parameters via a regularization/penalty for the undesired model structure.

Combining errors and penalty together we obtain a measure of model performance on a given data — cost function. It reflects both how the model fits to the data and our preferences of the model itself. During training phase we aim to minimize this function to get a better fit.

(5) Cost function, general form. Image by author

Below you can see a popular cost function used in many regression problems: MSE loss function combined with L2 weight regularization.

(6) Cost as MSE + L2. Image by author

Navigating in the Cost function

Let’s consider different options of redesigning cost function in order to address various challenges on the way to the optimal fit.

(1) Data transformation

There’re two concepts of data transformation: preserving data representation (data augmentation) and not (feature engineering). Both can be applied to the model input, internal parameters and output during train and test.

(7) Data transformation. Image by author

From this perspective, such techniques as Droupout and Batch normalization originate from the same principles as data pre-processing or aggregation predictions on augmented inputs.

Many feature engineering and data augmentation techniques are domain specific, but impressive results can also be achieved by applying a technique from one domain to a problem from another domain. For instance, translating audio signals into images via Fourier or wavelet transformation and then applying convolutional neural networks (CNN) designed for computer vision problems.

(8) Changing problem domain from signal to image processing. Image by author

One of the most counterintuitive techniques listed in the paper is adding random noise to the training data — paradoxically under certain conditions on the noise distribution it improves the model quality and increases its robustness.

(9) Smoothing effect of input noise. Image by author, based on the image from paper “The Effects of Adding Noise During Backpropagation Training on a Generalization Performance” [3]

(2) Model

Your fantasy is the only limit of the model archetype/network architecture. Moreover, you can create a meta-model combining predictions of different base models via different ensemble techniques: bagging, boosting and stacking.

Often models are the focus of discussions, so let’s skip it for today. I just suggest to reflect on the neural networks zoo by Fjodor van Veen.

(3) Loss function (error at one sample)

As discussed above, problem domain implies the choice of loss function. Nevertheless sufficient model quality can be often achieved even with the domain-agnostic loss functions, e.g. MSE for regression and cross-entropy for classification.

In case of imbalanced data it make sense to consider affecting class weights — it can also be achieved by downsampling/oversampling on data transformation level (1).

Affecting sample weights is especially important for time-series forecasting — it’s a convenient way of smoothing data anomalies or paying attention to the most recent patterns in data.

(4) Regularization term

Regularization term is independent of target, therefore can be computed for an unlabeled sample, whereas the loss function cannot. This distinction allows improving model robustness by combining labeled and unlabeled data in a semi-supervised learning manner.

Regularization terms based on weight decay are the most popular: L1, L2.

(11) L1, L2 regularization terms. Image by author

The idea of regularization is also applied in information criteria for model selection. For instance, in Akaike criterion we can observe penalty for the number of model parameters [4].

(12) Akaike criterion. Image by author

(5) Optimization algorithm

Optimization algorithm is essentially the tool that searches model parameters delivering the minimum value of the cost function on the training dataset.

The most basic optimization algorithm is Gradient Descent (GD). It’s based on the idea that the derivate of a function is a vector pointing in the direction of a local maximum of that function. Therefore, steps in the opposite direction lead to the local minimum — exactly what is desired.

(13) Gradient descent. Image by author

In classic Gradient Descent all training samples are used for each parameter update — there we have usually a path in the right direction, but learning can be quite slow. Using one example or subsample of all training examples instead increases training speed, but we get a more bumpy ride to the minimum of cost function. This algorithm variation is called Stochastic Gradient Descent (SGD).

There’re many ways to upgrade a basic Stochastic Gradient Descent optimizer. Some of them take into account first and second moments of the gradients on a per-parameter basis: Adaptive Gradient Algorithm (AdaGrad), Root Mean Square Propagation (RMSProp) and Adam.

Another interesting higher-level approach is a snapshot ensemble method. It aims to utilizing multiple local minima of cost function rather than one global minimum — the obtained model is an ensemble of models with parameters corresponding to the local minima (snapshots). The quick & efficient local minima search is achieved by varying learning rate in cycles: instead of constant learning rate here we first decrease learning rate to end up in a local minimum, save model snapshot, and then dramatically increase learning rate to jump out of the minimum and move to the next local minimum repeating the process.

As we saw in the method above, stopping criteria for optimizers is a non-obvious topic. Initialization of the model parameters can also be vital and can significantly accelerate training procedure.

Besides the optimizers above, there exists a class of optimizers that treat cost function as a “black box” and use surrogate models to approximate it. It’s useful when gradients of the cost function are hard to compute or they don’t exist. The most popular technique in this class is Bayesian optimization — it updates prior beliefs about cost function with each new sample drawn by approximating the cost function with Gaussian processes [6].

Equivalence of techniques

A remarkable thing is that sometimes methods that influence different components of cost function are equivalent. For instance, injection of small-variance Gaussian noise is an approximation of Jacobian penalty. There are many more examples in the paper [1].

Conclusion

Often we focus on a specific component that can improve the model quality, e.g. model weights. But usually when you solve a real problem, there’s no such limitation. Cost function is just a tool that helps us to convey our objective to computers/optimization algorithm. We have power to manipulate all of its components as long as it 1) helps us generalize data patterns better and 2) can be fed into optimization algorithms. So next time you feel stuck after long hours of training the model — check other options that can give you a quality gain.

References

[1] Jan Kukačka, Vladimir Golkov and Daniel Cremers, Regularization for Deep Learning: A Taxonomy (2017)

[2] Google, Machine Learning Crash Course, Regularization

[3] Guozhong An, The Effects of Adding Noise During Backpropagation Training on a Generalization Performance (1996)

[4] Rob J Hyndman, George Athanasopoulos, Forecasting: Principles and Practice

[5] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q., Snapshot Ensembles: Train 1, get M for free (2017)

[6] Martin Krasser, Bayesian optimization (2018)

Special thanks to Stepan Zaretsky who back in the day introduced me to the paper [1] and even validated some results about models benefiting from noise injection.

--

--