Generalized Huber Regression

Damian Draxler
Towards Data Science
7 min readAug 20, 2019

--

In this post we present a generalized version of the Huber loss function which can be incorporated with Generalized Linear Models (GLM) and is well-suited for heteroscedastic regression problems. We will discuss how to optimize this loss function with gradient boosted trees and compare the results to classical loss functions on an artificial data set.

Let’s first start with a brief recap of the Huber loss function and the basics of Generalized Linear Models (GLM).

Huber loss & Generalized Linear Models

The Huber loss [Huber] is a robust loss function for regression problems defined as

where y is the target variable, ŷ are the corresponding predictions and α ∈ ℝ⁺ is a hyperparameter. It is tempting to look at this loss as the log-likelihood function of an underlying heavy tailed error distribution. Indeed, for absolute errors smaller than α the corresponding distribution resembles the normal distribution, outside this region it coincides with the more heavy-tailed Laplace distribution. This is precisely the reason why this loss is robust against outliers.

That’s all we need to know for now about the Huber loss. Let us thus jump straight to GLMs, which are best understood by first remembering the assumptions of linear regression. In linear regression one often assumes that the error term in the linear relationship between the dependent variable Y and some feature vector X is normally distributed with mean zero and constant variance σ², i.e. Y|X ~ X^⊤ β + ε with ε ∈ 𝓝(0,σ²) and β being a set of variational parameters. One is interested in finding the best estimate β^hat that minimizes a quadratic cost function (corresponding to the log-likelihood of the distribution of ε). The estimate ŷ for a given X is then simply ŷ(X)=E[Y|X]=X^β^hat. Note that (in a maximum-likelihood interpretation) Huber regression replaces the normal distribution with a more heavy tailed distribution but still assumes a constant variance.

The GLM approach on the other hand relaxes the assumptions of linear regression in the following way:

  1. Non-normality of the random component:

2. Non-linearity introduced by a link function g:

The exponential family contains a variety of distributions and in particular some where the variance is a function of the mean like the Poisson or Gamma distribution. This feature comes in handy especially for heteroscedastic problems where the assumption of a constant variance of the error term does not hold anymore, as is, for example, often the case for a target variable whose range spans several orders of magnitude. The link function additionally enhances the model complexity by contributing non-linear effects. Note also that the link function is never applied to y. This stands in contrast with the common practice of fitting a model to the transformed target variable, which often leads to underestimating the mean once the predictions are back-transformed. This can be seen from Jensen’s inequality which states that E[g(Y)|X]≤ g(E[Y|X]) for any concave function g, as it is the case for the logarithm or the Box-Cox function. It is further possible to derive a closed-form expression for the GLM likelihood function which results in a broad class of loss functions, see e.g. [MIT] for an excellent introduction to GLMs by P. Rigollet. However, it seems that there is no continuous distribution with negative support and non-constant variance within the exponential family.

Wouldn’t it thus be possible to combine the idea of a link function with the Huber loss while still having a non-constant variance? The next section tries to address that question.

Generalized Huber Loss Function

For any invertible function g: ℝ ↦ ℝ we define the Generalized Huber Loss (GHL) function as

with α ∈ ℝ⁺, y the target variable and ŷ the continuous output of some predictive model. The most important observation here is that the case distinction is taken on the “link scale’’ defined by g(y), whereas the range is on the original scale. This loss function can not be transformed to a single variable problem [as is the case for equation(1)]. However, at fixed ŷ_0 one might consider 𝓛(y, ŷ_0) as the log-likelihood function of an error distribution 𝓟(y, ŷ_0) ~ exp(-𝓛(y, ŷ_0)). Further below we will illustrate that the variance of 𝓟(y, ŷ_0) is a (monotonically) increasing function of ŷ_0.

Let us now discuss what would happen if we took g(y) instead of g⁻¹(ŷ) on the right hand side of equation (2). This would simply correspond to first transforming the target variable and thus estimating E[g(Y)|X]. However, since E[g(Y)|X] ≤ g(E[Y|X]) for any concave function g(y) we would end up underestimating the mean.

The second option of taking g⁻¹(ŷ) on the right hand side of equation (2) and thus applying the case distinction on the original scale wouldn’t help much either. Keep in mind that we would like to address problems where the range of y can vary over several orders of magnitude. In such a case we would in general not be able to find an appropriate value of α to guarantee that for all ranges of y both case distinctions are applied. In other words, only by the choice in equation (2) we do get a distribution of non-constant variance.

The loss function in equation (2) however has jump discontinuities at the lines |g(y) - ŷ|= α which can be removed in one way or another. The following smoothed version of equation (2) turned out to work nicely in practice

where ∓ = sgn(g(y) - ŷ).
As shown in Figure [1] the function 𝓛_s(y_0, ŷ) for fixed y_0 has no local minima but is not convex either. Moreover, 𝓛_s(y_0, ŷ) exhibits a region of little slope which can lead to convergence issues in gradient based optimization routines. However, such issues can typically be overcome by the choice of a good starting vector. We will discuss this with an example further below.

Figure 1: Left: Smoothed generalized Huber function with y_0 = 100 and α =1. Right: Smoothed generalized Huber function for different values of α at y_0 = 100. Both with link function g(x) = sgn(x) log(1+|x|).

In Figure [2] we illustrate the aforementioned increase of the scale of 𝓟(y, ŷ_0) with increasing ŷ_0. It is precisely this feature that makes the GHL function robust and applicable to heteroscedastic problems. Note that the scale of 𝓟(y, ŷ_0) does also increase with increasing α as shown on the right hand side of Figure [2]. Note that we did not normalize 𝓟(y, ŷ_0). The corresponding normalization factor would depend on ŷ_0 and it would be interesting to investigate whether or not a closed-form expression could be derived.

Figure 2: Left: (Unnormalized) Error distribution for different values of ŷ_0 at α = 1. Right: (Unnormalized) Error distribution at ŷ_0 = 1 for different values of α. Both with link function g(x) = sgn(x) log(1+|x|).

In order to optimize 𝓛_s(y, ŷ) with gradient methods we need the gradient and the Hessian of this function. Those are however rather lengthy expressions and can thus be found in the Appendix further below. Instead, let us now turn to a final example.

Example

Let us consider a simple one dimensional problem with a skew normal error distribution of non-constant variance, i.e. y = x + ε_skewnormal. We have used the SciPy skewnorm object with skewness parameter a =100, loc = 0 and scale = 1 + |x|² and created a linearly spaced grid of N = 10⁶ examples from the interval [-500,500]. Further, we trained three LightGBM models. A root mean squared error (RMSE) model on g(y), a mean absolute error (MAE) model on g(y) and a GHL model, all with link function g(x) = sgn(x) log(1+|x|). The models were trained with early stopping based on the R²-score metric by using a 3-fold cross validation approach. The free parameter in the GHL function was also cross validated and chosen to be α = 0.67. All sampling parameters in LightGBM were set to 1 and num_leaves = 31 while leaving all other parameters as default.

In general one needs a good starting vector in order to converge to the minimum of the GHL loss function. Here we have first trained a small LightGBM model of only 20 trees on g(y) with the classical Huber objective function (Huber parameter α = 2). The output of this model was then used as the starting vector (init_score) of the GHL model.

Figure [3] shows a (binned) scatter plot of test set (0.7/0.3 split ratio) predictions of all three models with the prediction closest to the trend line coming from the GHL model. The table below also summarizes some metrics and exemplifies that the GHL model achieves significantly better results on the test set while at the same time needing less trees than the other two models.

A Jupyter Notebook of this example as well as a linear estimator package can be found here [GIT]. The linear estimator can also be installed with pip:

pip install ghlestimator

Comparison of RMSE, MAE and (smoothed) Generalized Huber. Predictions are partitioned into 150 equally sized bins and within each bin the mean is plotted against the true mean of that partition. Black dots show a trend line of 45 degrees.

Finally, I’d like to thank my colleague Sebastian Hofer for fruitful discussions on this topic.

Appendix

The gradient of 𝓛_s(y, ŷ) is given by

and the Hessian is found to be

where we have defined A = (g⁻¹(ŷ ± α) - g⁻¹(ŷ)) and B = (y - g⁻¹(g(y) α)) with ∓ = sgn(g(y) - ŷ).

--

--