The world’s leading publication for data science, AI, and ML professionals.

Loss Functions in Machine Learning

Understand the most common loss functions and when to use each one

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Loss functions have an important role in machine learning as they guide the learning process of the model and define its objective.

There is a large number of loss functions available and choosing the proper one is crucial for training an accurate model. Different choices of a loss function can lead to different classification or regression models.

In this article we will discuss the most commonly used loss functions, how they operate, their pros and cons, and when to use each one.

What is a Loss Function?

Recall that in supervised machine learning problems, we are given a training set of n labeled samples: D = {(x₁, _y_₁), (x₂, _y_₂), … , (xₙ, yₙ)}, where x represents the features of sample i and yᵢ represents the label of that sample. Our goal is to build a model whose predictions are as close as possible to the true labels.

A loss function measures the model’s prediction error for a given sample, i.e., the difference between the model’s predicted value and the true value for that sample. It takes two parameters: the true label of the sample y and the model’s prediction ŷ:

During the training of the model, we tune its parameters so as to minimize the loss function on the given training samples.

Note that a loss function ** calculates the error per sample, while a cost functio**n calculates the error over the whole data set (although these two terms are sometimes used interchangeably).

Desired Properties of a Loss Function

Ideally, we would like the loss function to have the following properties:

  • The loss function should reflect the objective the model is trying to achieve. For example, in regression problems our goal is to minimize the differences between the predictions and the target values, while in classification our goal is to minimize the number of misclassification errors.
  • Continuous and differentiable everywhere. Most optimization algorithms, such as gradient descent, require the loss function to be differentiable.
  • Convex. A convex function has only one global minimum point, thus optimization methods like gradient descent are guaranteed to return the globally optimal solution. In practice, this property is hard to achieve, and most loss functions are non-convex (i.e., they have multiple local minima).
  • Symmetric, i.e., the error above the target should cause the same loss as the same error below the target.
  • Fast to compute

Loss Functions and Maximum Likelihood

Many of the loss functions used in Machine Learning can be derived from the maximum likelihood principle (see my previous article for explanation on maximum likelihood).

In maximum likelihood estimation (MLE) we are trying to find a model with parameters θ that maximizes the probability of the observed data given the model: P(D|θ). To simplify the likelihood function, we typically take its logarithm, and then we try to maximize the log likelihood: log P(D|θ).

Therefore, we can define a loss function for a given sample (x, y) as the negative log likelihood of observing its true label given the prediction of our model:

Loss function as the negative log likelihood
Loss function as the negative log likelihood

Because negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the loss.

Note that to use this technique to define loss functions, we need to assume that the data set is generated from some known probability distribution.

In the next sections we will discuss the most common loss functions used in different types of problems (regression, binary classification and multi-class classification).

Regression Problems

In regression problem both the target label and the model’s prediction take continuous values. The three most commonly used loss functions in regression problems are: squared loss, absolute loss and Huber loss.

Squared Loss

Squared loss is defined as the squared difference between the target label and its predicted value:

Squared loss
Squared loss

This loss function is used in ordinary least squares (OLS), which is the most common method for solving linear regression problems.

Pros:

  • Continuous and differentiable everywhere
  • Convex (has only one global minimum)
  • Easy to compute
  • Under the assumption that the labels have a Gaussian noise, squared loss is the negative maximum likelihood of the model given the data. You can find a proof of this statement in my previous article.

Cons:

  • Sensitive to outliers due to the squaring of the errors. A small number of samples that are distant from the other samples can cause a large change in the model (as will be demonstrated later).

Absolute Loss

The absolute loss is defined as the absolute difference between the true label and the model’s prediction:

Absolute loss
Absolute loss

Pros:

  • Not overly affected by outliers
  • Easy to compute

Cons:

  • Non-differentiable at 0, which makes it hard use it in optimization methods such as gradient descent.
  • Does not have a maximum likelihood interpretation

Huber Loss

Huber loss is a combination of squared loss and absolute loss. For loss values that are less than a predefined parameter called δ, it uses the squared error, and for values greater than δ it uses the absolute error.

The mathematical definition of Huber loss is:

Huber loss
Huber loss

δ is typically set to 1.

Huber loss is commonly used in deep learning where it helps to avoid the exploding gradient problem due to its insensitivity to large errors.

Pros:

  • Continuous and differentiable everywhere
  • Less sensitive to outliers than squared loss

Cons:

  • Slower to compute
  • Requires tuning of the hyperparameter δ
  • Does not have a maximum likelihood interpretation

The following graph shows the three regression loss functions:

Loss functions for regression problems
Loss functions for regression problems

Scikit-Learn Example

The SGDRegressor class fits a linear regression model to a given data set using stochastic gradient descent (SGD). Its loss parameter can be used to choose the loss function for the optimization. The options of this parameter are:

  • _squarederror (squared loss). This is the default option.
  • huber (Huber loss)
  • _epsilonintensive (the loss function used in Support Vector Regression)

Let’s examine the effect of using the Huber loss instead of squared loss on a sample data set that contains an outlier.

We first define our data set:

x = np.array([0.5, 1.8, 2.4, 3.5, 4.2, 4.8, 5.8, 6.1, 7.2, 8.7, 10])
y = np.array([0.1, 0.2, 0.3, 0.4, 0.7, 1, 0.9, 1.2, 1.4, 1.8, 10])

Let’s plot the data points:

def plot_data(x, y):
    plt.scatter(x, y)
    plt.xlabel('$x$')
    plt.ylabel('$y$')
    plt.grid()
plot_data(x, y)
The training set
The training set

Clearly the point (10, 10) is an outlier.

Next, we fit two SGDRegressor models to this data set: one with a squared loss function and another with a Huber loss.

from sklearn.linear_model import SGDRegressor

X = x.reshape(-1, 1) # Convert x to a matrix with one column

reg = SGDRegressor(loss='squared_error')
reg.fit(X, y)

reg2 = SGDRegressor(loss='huber')
reg2.fit(X, y)

Let’s plot the two regression lines found by these models:

def plot_regression_line(x, y, w0, w1, color, label):
    p_x = np.array([x.min(), x.max()])
    p_y = w0 + w1 * p_x
    plt.plot(p_x, p_y, color, label=label)
plot_data(x, y)
plot_regression_line(x, y, reg.intercept_, reg.coef_[0], 'r', label='Squared loss')
plot_regression_line(x, y, reg2.intercept_, reg2.coef_[0], 'm', label='Huber loss')
plt.legend()
The regression lines found by squared and Huber losses
The regression lines found by squared and Huber losses

It is clearly evident that the model trained with squared loss was much more affected by the outlier than the model trained with Huber loss.

Binary Classification Problems

In binary classification problems, the ground truth labels are binary (1/0 or 1/-1). The predicted value of the model can be either binary (a hard label) or a probability estimate that the given sample belongs to the positive class (a soft label).

Examples for classification models that provide only hard labels include support vector machines (SVMs) and K-nearest neighbors (KNNs), while models such as logistic regression and neural networks (with a sigmoid output) also provide a probability estimate.

0–1 Loss

The simplest loss function is the zero-one loss function (also called misclassification error):

Zero-one loss
Zero-one loss

I is the indicator function that returns 1 if its input is true, and 0 otherwise.

For every sample that the classifier gets wrong (misclassifies) a loss of 1 is incurred, whereas correctly classified samples lead to 0 loss.

The 0–1 loss function is often used to evaluate classifiers, but is not useful in guiding optimization since it is non-differentiable and non-continuous.


Log Loss

Log loss (also called logistic loss or binary cross-entropy loss) is used to train models that provide class probability estimates such as logistic regression.

Let us denote the probability estimate given by the model that the sample belongs to the positive class by p:

Then log loss is defined as:

Log loss
Log loss

How did we get to this loss function? Again we are going to use the maximum likelihood principle. More specifically, we will show that log loss is the negative log likelihood under the assumption that the labels have a Bernoulli distribution (a probability distribution of a binary random variable that takes 1 with probability p and 0 with probability 1 − p). Mathematically, this can be written as follows:

Proof:

Given a model of the data (the labels) as a Bernoulli distribution with parameter p, the probability that a sample belongs to the positive class is simply p, i.e.,

Similarly, the probability that the sample belongs to the negative class is:

We can write these two equations more compactly as follows:

Explanation: when y = 1, pʸ = p and (1 − p)¹⁻ʸ = 1, therefore P(y|p) = p. Similarly, when y = 0, pʸ = 1 and _(_1 − p)¹⁻ʸ = 1 − p, therefore P(y|p) = 1 − p.

Therefore the log likelihood of the data is:

The log loss is exactly the negative of this function!


The log loss function is differentiable and convex, i.e., it has a unique global minimum.

Hinge Loss

Hinge loss is used for training support vector machines (SVMs), where the goal is to maximize the margin of the area that separates the two classes while minimizing the margin violations.

SVM
SVM

The hinge loss is defined as follows:

Hinge loss
Hinge loss

Note that ŷ here is the raw output of the classifier’s decision function, i.e., ŷ = wx (SVM does not provide probability estimates).

When y and ŷ have the same sign (i.e., the model predicts the correct class) and |ŷ| ≥ 1, the hinge loss is 0. This means that correctly classified samples that are outside the margin do not contribute to the loss (the solution will be the same with these samples removed). However, for samples that are inside the margins (|ŷ| < 1), even if the model’s prediction is correct, there will still be a small loss. When y and ŷ have opposite signs, the hinge loss grows linearly with ŷ.

Support vector machines will be covered in more detail in a future article.


The following graph shows the three classification loss functions:

Loss functions for binary classification problems
Loss functions for binary classification problems

Both log loss and hinge loss can be seen as continuous approximations to the 0–1 loss.

Multi-Class Classification Problems

In multi-class classification problems, the target label is 1 out of k classes. The label is usually encoded using one-hot encoding, i.e., as a binary k-dimensional vector y = (_y_₁, …, yₖ), where yᵢ = 1 for the true class i and 0 elsewhere.

A probabilistic classifier outputs for each sample a k-dimensional vector with probability estimates of each class: p = (_p_₁, …, pₖ). These probabilities sum to 1, i.e., _p_₁ + … + pₖ = 1.

Cross-Entropy Loss

The loss function used to train such a classifier is called cross-entropy loss, which is an extension of log loss to the multi-class case. It is defined as follows:

Cross-entropy loss
Cross-entropy loss

For example, assume that we have a three-class problem, the true class of our sample is class 2 (i.e., y = [0, 1, 0]), and the prediction of our model is p = [0.3, 0.6, 0.1]. Then the cross-entropy loss induced by this sample is:

To see how the cross-entropy loss generalizes log loss, notice that in the binary case _p_₁ = 1 – _p_₀ and _y_₁ = 1 – _y_₀, therefore we get:

which is exactly the log loss for p = _p_₀ and y = _y_₀.

Similar to log loss, we can show that cross-entropy loss is the negative of the log-likelihood of the model, under the assumption that the labels are sampled from a categorical distribution (a generalization of Bernoulli distribution to k possible outcomes).

Proof:

Given a model of the data (the labels) as a categorical distribution with probabilities p = (_p_₁, …, pₖ), the probability that a given sample belongs to class i is pᵢ:

Therefore, the probability that the true label of the sample is y is:

Explanation: if the correct class of the given sample is i, then yᵢ = 1, and for all j ≠ i, yⱼ = 0. Hence, P(y|p) = pᵢ, which is the probability that the sample belongs to class i.

Therefore, the log likelihood of our model is:

The cross-entropy loss is exactly the negative of this function!


Key Takeaways

  • In this article we have discussed various loss functions and showed how they are derived from basic principles such as maximum likelihood.
  • In regression problems, squared loss is the most common loss function. However, if you suspect that your data set contains outliers, using Huber loss may be a better choice.
  • In binary classification problems, the choice of a different loss function leads to a different classifier (logistic regression uses log loss while SVM uses hinge loss).
  • In multi-class classification problems, cross-entropy loss is the most common loss function and is extension of log loss to the multi-class case.

Final Notes

All images unless otherwise noted are by the author.

The code examples of this article can be found on my github: https://github.com/roiyeho/medium/tree/main/loss_functions

Thanks for reading!


Related Articles