Label Smoothing — Make your model less (over)confident

An intuitive explanation to tackle overconfidence in Machine Learning.

Parthvi Shah
Towards Data Science

--

A certain death of an artist is overconfidence — Robin Trower

Remember the time when you switched off google maps because you were so confident that you know the way to your destination but it turns out that the road was shut due to some construction.. and you had to make use of google maps after all. That was you being over-confident. How to tackle overconfidence in machines? Sometimes, it is easy to confuse one thing with other. Hence, it is better to be just a little less confident about certain things which you are 100% confident about.

source: unsplash

Label Smoothing prevents the network from becoming over-confident and has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Label smoothing is a simple yet effective regularization tool operating on the labels.

By talking about overconfidence in Machine Learning, we are mainly talking about hard labels.

Soft label: A soft label is a score which has some probability/likelihood attached to it. Eg: (0.1 0.2 0.8)

Hard label: A hard label is generally a part of either one of the two classes. It is binary in nature (0 or 1)

For binary cross-entropy loss, we convert the hard labels into soft labels by applying a weighted average between the uniform distribution and the hard labels. Label smoothing is often used to increase robustness and improve classification problems.

source: Delving Deep into Label Smoothing

Label smoothing is a form of output distribution regularization that prevents overfitting of a neural network by softening the ground-truth labels in the training data in an attempt to penalize overconfident outputs.

The intuition behind label smoothing is not letting the model learn that a specific input results in a specific output only. This is kind-of related to overfitting but it has more to do with overconfidence. Instead of assigning 100% probability to a certain class index as shown above, we can convert the hard labels ie. 100% to 91% and give away rest 9% uncertainty to other classes which are not confident at all. This doesn’t harm the model performance and at the same time provides access to higher generalization.

Tensorflow makes it easier to implement label_smoothing with cross entropy loss by just specifying as a parameter.

You can perform label smoothing using this formula:

new_labels = original_labels * (1 – label_smoothing) + label_smoothing / num_classes

Example: Imagine you have three classes with label_smoothing factor as 0.3.

Then, new_labels according to the above formula will be:

= [0 1 2] * (1– 0.3) + ( 0.3 / 3 )

= [0 1 2] * (0.7 )+ 0.1 = [ 0.1 0.8 1.5 ]

Now, the new labels will be [0.1 0.8 1.5] instead of [0 1 2]

As you can see, the model becomes less confident with extremely confident labels. That is exactly what we wanted to avoid. Now, the penalty given to a model due to an incorrect prediction will be slightly lower than using hard labels which would result in a smaller gradients.

References:

[1] https://ai.stackexchange.com/questions/9635/about-the-definition-of-soft-label-and-hard-label

[2] When Does Label Smoothing Help?

[3] Delving Deep into Label Smoothing

--

--