The world’s leading publication for data science, AI, and ML professionals.

Focal Loss : A better alternative for Cross-Entropy

Focal loss is said to perform better than Cross-Entropy loss in many cases. But why Cross-Entropy loss fails, and how Focal loss addresses…

Gradient Descent, Photo by Rostyslav Savchyn on Unsplash
Gradient Descent, Photo by Rostyslav Savchyn on Unsplash

Loss functions are mathematical equations that calculate how far the predictions deviate from the actual values. Higher loss values suggest that the model is making a significant error, whereas lower loss values imply that the predictions are rather accurate. The goal is to reduce the loss function as much as possible. The loss function is used by models to learn the trainable parameters, such as weights and biases. Because the weight updation equation of the parameters has the first derivative of the loss function with respect to the weights or biases, the behaviour of this function will have a significant impact on the gradient descent process.

Weight updating formula, Image Source : Author
Weight updating formula, Image Source : Author

There are numerous loss functions available right now. Each of them has a different mathematical equation and a different method of penalising the model’s errors. There are also benefits and drawbacks to each, which we must weigh before deciding on the best function to use.

Now that we’ve defined the loss function, let’s go over the issues that Categorical Cross-Entropy loss causes and how Focal loss solves them.

Categorical Cross-Entropy Loss

Categorical Cross-Entropy loss is traditionally used in classification tasks. As the name implies, the basis of this is Entropy. In statistics, entropy refers to the disorder of the system. It quantifies the degree of uncertainty in the model’s predicted value for the variable. The sum of the entropies of all the probability estimates is the cross entropy.

Expression for Cross Entropy, Image Source : Author
Expression for Cross Entropy, Image Source : Author

Where Y is the true label and p is the predicted probability.

Note: The formula shown above is for a discrete variable. In the case of a continuous variable, the summation should be replaced by integration.

The logarithm graph clearly shows that the summation will be negative because probabilities range from 0 to 1. As a result, we add a minus to invert the sign of the summation term. The graph of log(x) and -log(x) is shown below (x).

Graphs for log(x) in red and -log(x) in blue, Image Source : Author. Graph plotted using Desmos.
Graphs for log(x) in red and -log(x) in blue, Image Source : Author. Graph plotted using Desmos.

Cases where Cross-Entropy loss performs badly

  • Class imbalance inherits bias in the process. The majority class examples will dominate the loss function and gradient descent, causing the weights to update in the direction of the model becoming more confident in predicting the majority class while putting less emphasis on the minority classes. Balanced Cross-Entropy loss handles this problem.
  • Fails to distinguish between hard and easy examples. Hard examples are those in which the model repeatedly makes huge errors, whereas easy examples are those which are easily classified. As a result, Cross-Entropy loss fails to pay more attention to hard examples.

Balanced Cross-Entropy Loss

Balanced Cross-Entropy loss adds a weighting factor to each class, which is represented by the Greek letter alpha, [0, 1]. Alpha could be the inverse class frequency or a hyper-parameter that is determined by cross-validation. The alpha parameter replaces the actual label term in the Cross-Entropy equation.

Expression for Balanced Cross Entropy, Image Source : Author
Expression for Balanced Cross Entropy, Image Source : Author

Despite the fact that this loss function addresses the issue of class imbalance, it cannot distinguish between hard and easy examples. The problem was solved by focal loss.

Focal Loss

Focal loss focuses on the examples that the model gets wrong rather than the ones that it can confidently predict, ensuring that predictions on hard examples improve over time rather than becoming overly confident with easy ones.

How exactly is this done? Focal loss achieves this through something called Down Weighting. Down weighting is a technique that reduces the influence of easy examples on the loss function, resulting in more attention being paid to hard examples. This technique can be implemented by adding a modulating factor to the Cross-Entropy loss.

Expression for Focal Loss, Image Source : Author
Expression for Focal Loss, Image Source : Author

Where γ (Gamma) is the focusing parameter to be tuned using cross-validation. The image below shows how Focal Loss behaves for different values of γ.

Down weighting increases with an increase in γ, Image Source : Focal Loss Research Paper
Down weighting increases with an increase in γ, Image Source : Focal Loss Research Paper

How gamma parameter works?

  • In the case of the misclassified sample, the pi is small, making the modulating factor approximately or very close to 1. That keeps the loss function unaffected. As a result, it behaves as a Cross-Entropy loss.
  • As the confidence of the model increases, that is, pi → 1, modulating factor will tend to 0, thus down-weighting the loss value for well-classified examples. The focusing parameter, γ ≥ 1, will rescale the modulating factor such that the easy examples are down-weighted more than the hard ones, reducing their impact on the loss function. For instance, consider predicted probabilities to be 0.9 and 0.6. Considering γ = 2, the loss value calculated for 0.9 comes out to be 4.5e-4 and down-weighted by a factor of 100, for 0.6 to be 3.5e-2 down-weighted by a factor of 6.25. From the experiments, γ = 2 worked the best for the authors of the Focal Loss paper.
  • When γ = 0, Focal Loss is equivalent to Cross Entropy.

In practice, we use an α-balanced variant of the focal loss that inherits the characteristics of both the weighing factor α and the focusing parameter γ, yielding slightly better accuracy than the non-balanced form.

Expression for α-balanced Focal Loss, Image Source : Author
Expression for α-balanced Focal Loss, Image Source : Author

Focal Loss naturally solved the problem of class imbalance because examples from the majority class are usually easy to predict while those from the minority class are hard due to a lack of data or examples from the majority class dominating the loss and gradient process. Because of this resemblance, the Focal Loss may be able to solve both problems.

Thank you for reading the article 😃 . Have a nice day.


Related Articles