
Recently, I came across the amazing paper presented in CVPR 2019 by Jon Barron about developing a robust and adaptive Loss function for Machine Learning problems. This post is a review of that paper along with some necessary concepts and, it will also contain an implementation of the loss function on a simple regression problem.
Problem with Outliers and Robust Loss:
Consider one of the most used errors in Machine Learning problems- Mean Squared Error (MSE). As you know it is of the form (y-x)². One of the key characteristics of the MSE is that its high sensitivity towards large errors compared to small ones. A model trained with MSE will be biased towards reducing the largest errors. For example, a single error of 3 units will be given same importance to 9 errors of 1 unit.
I created an example using Scikit-Learn to demonstrate how the fit varies in a simple data-set with and without taking the effect of outliers.

As you can see the fit line including the outliers get heavily influenced by the outliers but, the optimization problem should require the model to get more influenced by the inliers. At this point you can already think about Mean Absolute Error (MAE) as better choice than MSE, due to less sensitivity to large errors. There are various types of robust losses (like MAE) and for a particular problem we may need to test various losses. Wouldn’t it be amazing to test various loss functions on the fly while training a network? The main idea of the paper is to introduce a generalized loss function where the robustness of the loss function can be varied and, this hyperparameter can be trained while training the network, to improve performance. It is way less time consuming than finding the best loss say, by performing grid-search cross-validation. Let’s get started with the definition below –
Robust and Adaptive Loss: General Form:
The general form of the robust and adaptive loss is as below –

α controls the robustness of the loss function. c can be considered as a scale parameter which controls the size of the bowl near x=0. Since α acts as hyperparameter, we can see that for different values of α the loss function takes familiar forms. Let’s see below –

The loss function is undefined at α = 0 and 2, but taking the limit we can make approximations. From α =2 to α =1 the loss smoothly makes a transition from L2 loss to L1 loss. For different values of α we can plot the loss function to see how it behaves (fig. 2).
We can also spend some time with the first derivative of this loss function because the derivative is needed for gradient-based optimization. For various values of α the derivatives w.r.t x are shown below. In figure 2, I have also plotted the derivatives along with the loss function for different α.

Behaviour of Adaptive Loss and Its Derivative:
The figure below is very important to understand the behaviour of this loss function and its derivative. For the plots below, I have fixed the scale parameter c to 1.1. When x = 6.6, we can consider this as like x = 6× c. We can draw the following inferences about the loss and its derivative –
- The loss function is smooth for x, α and c>0 and thus suited for gradient based optimization.
- The loss is always zero at origin and increases monotonically for |x|>0. Monotonic nature of the loss can also be compared with taking log of a loss.
- The loss is also monotonically increasing with increasing α. This property is important for the robust nature of loss function because we can start with a higher value of α and then gradually reduce (smoothly) during optimization to enable robust estimation avoiding local minima.
- We see that when |x|<c, the derivatives are almost linear for different values of α. This implies that the derivatives are proportional to residual’s magnitude when they are small.
- For α = 2 the derivative is throughout proportional to the residual’s magnitude. This is in general the property of MSE (L2) loss.
- For α = 1 (gives us L1 Loss), we see that the derivative’s magnitude saturates to a constant value (exactly 1/c) beyond |x|>c. This implies that effect of residuals never exceeds a fixed amount.
- For α < 1, the derivative’s magnitude decreases as |x|>c. This implies that when residual increases it has less effect on the gradient, thus the outliers will have less effect during gradient descent.

I have also plotted below the surface plots of robust loss and its derivative for different values of α.

Implementing Robust Loss: Pytorch and Google Colab:
Since we have gone through the basics and properties of the robust and adaptive loss function, let us put this into action. Codes used below are just slightly modified from what can be found in Jon Barron’s GitHub repository. I have also created an animation to depict how the adaptive loss finds the best-fit line as the number of iteration increases.
Rather than cloning the repository and working with it, we can install it locally using pip
in Colab.
!pip install git+https://github.com/jonbarron/robust_loss_pytorch
import robust_loss_pytorch
We create a simple linear dataset including normally distributed noise and also outliers. Since the library uses pytorch, we convert the numpy arrays of x, y to tensors using torch.
import numpy as np
import torch
scale_true = 0.7
shift_true = 0.15
x = np.random.uniform(size=n)
y = scale_true * x + shift_true
y = y + np.random.normal(scale=0.025, size=n) # add noise
flip_mask = np.random.uniform(size=n) > 0.9
y = np.where(flip_mask, 0.05 + 0.4 * (1. - np.sign(y - 0.5)), y)
# include outliers
x = torch.Tensor(x)
y = torch.Tensor(y)
Next we define a Linear regression class using pytorch modules as below-
class RegressionModel(torch.nn.Module):
def __init__(self):
super(RegressionModel, self).__init__()
self.linear = torch.nn.Linear(1, 1)
## applies the linear transformation.
def forward(self, x):
return self.linear(x[:,None])[:,0] # returns the forward pass
Next, we fit a linear regression model to our data but, first the general form of the loss function is used. Here we use a fixed value of α (α = 2.0) and it remains constant throughout the optimization procedure. As we have seen for α = 2.0 the loss function replicates L2 loss and this as we know is not optimal for problems including outliers. For optimization we use the Adam optimizer with a learning rate of 0.01.
regression = RegressionModel()
params = regression.parameters()
optimizer = torch.optim.Adam(params, lr = 0.01)
for epoch in range(2000):
y_i = regression(x)
# Use general loss to compute MSE, fixed alpha, fixed scale.
loss = torch.mean(robust_loss_pytorch.general.lossfun(
y_i - y, alpha=torch.Tensor([2.]), scale=torch.Tensor([0.1])))
optimizer.zero_grad()
loss.backward()
optimizer.step()
Using the general form of the robust loss function and a fixed value of α, we can obtain the fit line. The original data, true line (line with the same slope and bias used to generate data-points excluding the outliers) and fit line are plotted below in fig. 4.

The general form of the loss function doesn’t allow α to change and thus we have to fine tune the α parameter by hand or by performing a grid-search. Also, as the figure above suggests that the fit is affected by the outliers because we used L2 loss. This is the general scenario but, what happens if we use the adaptive version of the loss function ? We call the adaptive loss module and just initialize α and let it adapt itself at each iteration step.
regression = RegressionModel()
adaptive = robust_loss_pytorch.adaptive.AdaptiveLossFunction(
num_dims = 1, float_dtype=np.float32)
params = list(regression.parameters()) + list(adaptive.parameters())
optimizer = torch.optim.Adam(params, lr = 0.01)
for epoch in range(2000):
y_i = regression(x)
loss = torch.mean(adaptive.lossfun((y_i - y)[:,None]))
# (y_i - y)[:, None] # numpy array or tensor
optimizer.zero_grad()
loss.backward()
optimizer.step()
Using this, and also some extra bit of code using Celluloid module, I created the animation below (figure 5). Here, you clearly see, how with increasing iterations adaptive loss finds the best fit line. This is close to the true line and it is negligibly affected by the outliers.

Discussion:
We have seen how the robust loss including an hyperparameter α can be used to find the best loss-function on the fly. The paper also demonstrates how the robustness of the loss-function with α as continuous hyperparameter can be introduced to classic computer vision algorithms. Examples of implementing adaptive loss for Variational Autoencoder and Monocular depth estimations are shown in the paper and these codes are also available in Jon’s GitHub. However, the most fascinating part for me was the motivation and step by step derivation of the loss function as described in the paper. It’s easy to read so, I suggest to take a look at the paper!
Stay strong and cheers!!
References:
[1] "A General and Adaptive Robust Loss Function"; J. Barron, Google Research.
[2] Robust-Loss: Linear regression example; Jon Barron’s GitHub.
[3] Surface plot of robust loss and animation: GitHub Link.