R-NL: Robust Nonlinear Shrinkage

High-dimensional covariance estimation when tails are heavy

Published in

Towards Data Science

10 min readNov 2, 2022

In this article, I discuss a new covariance estimation method from our recent paper “R-NL: Covariance Matrix Estimation for Elliptical Distributions based on Nonlinear Shrinkage’’. I introduce the problem we are solving, try to give some intuition on how we are solving it, and briefly present the simple code we developed. On the way, I touch upon some interesting concepts, like the (robust) “Tyler’s estimator”, which I feel are somewhat underused in modern data science, probably because no author apparently ever provided code with their paper (or maybe it is because most of the papers regarding this topic appear to appear in the signal processing community, which is another potential reason they get overlooked by data scientists).

Nonlinear Shrinkage

Nonlinear shrinkage is a powerful tool in high-dimensional covariance estimation. In the new paper, I and my coauthors introduce an adapted version that tends to produce even better results in heavy-tailed models while keeping the strong results in other cases. To showcase this new approach and the problem it solves, let’s start with an R example. We first load the necessary functions and define the dimensions p and numbers of examples n, as well as the true covariance matrix we want to study.

# For simulating the data:
library(mvtnorm)# The NL/QIS method available on https://github.com/MikeWolf007/covShrinkage/blob/main/qis.R
source("qis.R")# Set the seed, n and p
set.seed(1)# p quite high relativ to n
n<-300
p<-200#Construct the dispersion matrix
Sig<-sapply(1:p, function(i) {sapply(1:p, function(j) 0.7^{abs(i-j)} )} )

The covariance matrix we defined here corresponds to an AR process. That is, while the observations are independent, the dimensions X_i and X_j are less related the bigger the difference between i and j are in absolute values. This means correlations get exponentially smaller, the farther away one is from the diagonal and there is actually some structure to be learned here.

First, let’s simulate an i.i.d. sample of random vectors from a Gaussian distribution with the correlation structure defined before:

### Multivariate Gaussian case
X<-rmvnorm(n = n, sigma = Sig)

In the nonlinear shrinkage article, I explained what the optimal estimator is if one is only allowed to modify the eigenvalues of the sample covariance matrix, but has to leave the eigenvectors intact (which is what a lot of shrinkage methods do): Let thus u_j, j=1,..p, be the eigenvectors of the sample covariance matrix and

the matrix of eigenvectors. Then the optimal values are given by

resulting in the optimal estimator

Note that the optimal values in (1) are not exactly equal to the true eigenvalues in general, but instead linear combinations of the true eigenvalues, according to the (fixed) sample eigenvectors.

So let’s calculate the sample covariance matrix and the nonlinear shrinkage matrix and check how close we are to this ideal case:

## Sample Covariance Matrix
samplespectral<-eigen(cov(X))## Nonlinear Shrinkage
Cov_NL<-qis(X)
NLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)## Optimal: u_j'*Sig*u_j for all j=1,...,p
optimalvals<-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)plot(sort(samplespectral$values, decreasing=T), type="l", cex=1.5, lwd=2, lty=2, ylab="Eigenvalues",)
lines(optimalvals, type="l", col="red", cex=1.5, lwd=2, lty=1)
lines(NLvals, type="l", col="green", cex=1.5, lwd=2, lty=3)
title(main="Multivariate Gaussian")legend(200, 8, legend=c("Sample Eigenvalues", "Attainable Truth", "NL"),col=c("black", "red", "green"), lwd=2, lty=c(2,1,3), cex=1.5)

which gives the plot:

The plot shows the optimal value on the diagonal of (1), as well as the sample and nonlinear shrinkage estimates. It looks like one would hope, the sample eigenvalues show excess dispersion (too large for large values, too small for small ones), while nonlinear shrinkage is pretty close to the ideal values. This is what we would expect simply because the dimension p=200 is quite high compared to the sample size n=300. The larger p is chosen relative to n the worse this would look for the sample covariance matrix.

Now we do the same, but simulating from a multivariate t-distribution with 4 degrees of freedom, a (very) heavy-tailed distribution:

### Multivariate t case
X<-rmvt(n=n, sigma=Sig, df=4)## Truth
Sig <-4/(4-2)*Sig  ## Need to rescale with a t distribution## Sample Covariance Matrix
samplespectral<-eigen(cov(X))## Nonlinear Shrinkage
Cov_NL<-QIS(X)$Sig
NLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)## Optimal: u_j'*Sig*u_j for all j=1,...,p
optimalvals<-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)plot(sort(samplespectral$values, decreasing=T), type="l", cex=15, lwd=2, lty=2, ylab="Eigenvalues",)
lines(optimalvals, type="l", col="red", cex=1.5, lwd=2, lty=1)
lines(NLvals, type="l", col="green", cex=1.5, lwd=2, lty=3)
title(main="Multivariate t")legend(200, 40, legend=c("Sample Eigenvalues", "Attainable Truth", "NL"),col=c("black", "red", "green"), lty=c(2,1,3), cex=1.5, lwd=2)

Now, this doesn’t look as good anymore! The nonlinear shrinkage values in green also show some excess dispersion: Large values get way too large, while small values are a bit too small. Apparently, in finite samples, heavy tails can distort nonlinear shrinkage. It would be nice to have a method that displays the same amazing results in the Gaussian case but also keeps good results in heavy-tailed models. This is the motivation behind our new method. I now go into some details, beginning with the key to the approach: elliptical distributions.

Elliptical distributions

The class of elliptical distributions includes a reasonable range of different distributions such as multivariate Gaussians, multivariate t, multivariate generalized hyperbolic, and so on. If a random vector X follows an elliptical distribution, it can be written as

where S is uniformly distributed on the unit sphere in p dimensions and R is some nonnegative random variable independent of S. This may sound complicated, but it just means that an elliptical distribution can be reduced to a uniform distribution on a circle (in two dimensions) or on a sphere (in general). Thus these kinds of distributions have a very specific structure. In particular, we need to mention an important point: In the equation above, H is called the dispersion matrix, as opposed to the covariance matrix. In this article, we want to estimate the covariance matrix, which, if it exists, is given as

This is already interesting: In elliptical models, H exists by assumption but the covariance matrix might not if the expected value of R is not finite! For instance, the multivariate t distribution with degrees of freedom smaller than 2 has no covariance matrix. Nonetheless, we could still estimate H in this case. So the dispersion matrix is in a sense a more general concept. In this article, however, we will assume the covariance matrix exists, and in this case, we see from the above that dispersion and covariance matrix are the same up to a constant.

Interestingly, one can then look at Z=X/||X||, that is the random vector X divided by its Euclidean norm and this thing will always have the same distribution! In fact, this is just the uniform distribution on the p-dimensional sphere. We can see this in an example for p=2.

X<-rmvnorm(n = n, sigma = diag(2))
Y<-rmvt(n = n, sigma = diag(2), df=4)# standardize by norm
ZGaussian<-t(apply(X,1, function(x) x/sqrt(sum(x^2)) ))
Zt <- t(apply(Y,1, function(x) x/sqrt(sum(x^2)) ))par(mfrow=c(1,2))
plot(ZGaussian)
plot(Zt)

which gives

(Linear-Shrinkage) Tyler’s Estimator

Tyler’s estimator of the dispersion matrix H, derived in [1], uses the fact that Z=X/||X|| always has the same distribution. Using the likelihood of that distribution, one can derive a maximum likelihood estimator (basically just taking derivatives and setting to zero) that looks like this:

Note that this only defines H implicitly (it is both on the left and on the right), so the natural way to try to get to H is iterative:

where the second step is just a renormalization that is needed for technical reasons. One can show that indeed, this simple iterative scheme will converge to the solution in (2). This estimator of H is what is referred to as ‘’Tyler’s estimator’’.

Tyler’s estimator is an iterative estimate of the dispersion matrix of an i.i.d. sample from an elliptical distribution. It is derived using the fact that an elliptical random vector standardized by its Eucledian norm always has the same distribution.

Ok, so this is a way of robustifying the covariance or dispersion estimator against heavy tails. But the above Tyler’s estimator only works for p < n, and deteriorates when p gets closer to n, so we still need to robustify against the case when p is close or even larger than n. A lot of papers in the signal processing community simply do this by using linear shrinkage (which I also explained here) in each iteration. This then looks like this:

where different choices of rho have inspired different papers. For instance, one of the papers that started this line of research is [2]. We now instead use nonlinear shrinkage together with Tyler’s method.

One can robustify Tyler’s estimator also for high dimensions, by using linear shrinkage in each iteration. Different papers have found smart adaptive ways to calculate the free parameter rho.

Robust Nonlinear Shrinkage

The goal would now be to use the same iterative scheme as above, but instead of using linear shrinkage, iterate with nonlinear shrinkage. Unfortunately, this is not so straightforward. I won’t go into detail in this article, since some trickery is needed to make it work, but instead, I refer to our implementation on Github and the paper. The implementation we provide also may be quite handy, because none of the above-mentioned papers of the signal processing community appear to give out code or implement their method in a package.

To showcase the performance and the code, we repeat the multivariate t analysis at the beginning with this new method. We restate the whole procedure above for completeness:

### Multivariate t case
X<-rmvt(n=n, sigma=Sig, df=4)## Truth
Sig <-4/(4-2)*Sig  ## Need to rescale with a t distribution## Sample Covariance Matrix
samplespectral<-eigen(cov(X))## Nonlinear Shrinkage
Cov_NL<-QIS(X)$Sig
NLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)## R-NL code from https://github.com/hedigers/RNL_Code
Cov_RNL<-RNL(X)
RNLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_RNL%*%samplespectral$vectors  ), decreasing=T)## Optimal: u_j'*Sig*u_j for all j=1,...,p
optimalvals<-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)plot(sort(samplespectral$values, decreasing=T), type="l", cex=1.5, lwd=2, lty=2, ylab="Eigenvalues",)
lines(optimalvals, type="l", col="red", cex=1.5, lwd=2, lty=1)
lines(NLvals, type="l", col="green", cex=1.5, lwd=2, lty=3)
lines( RNLvals, type="l", col="darkblue", cex=1.5, lwd=2, lty=3)
title(main="Multivariate t")legend(200, 40, legend=c("Sample Eigenvalues", "Attainable Truth", "NL", "R-NL"),
       col=c("black", "red", "green", "darkblue"), lty=c(2,1,3,4), cex=1.5, lwd=2)

R-NL (the blue line) is almost perfectly on the red line and thus mirrors the good performance we saw for NL in the Gaussian case! This is exactly what we wanted. Also, though it might be obscure in the code above, using the function is super easy, RNL(X) gives the estimator of the covariance matrix (if you can assume it exists) and RNL(X, cov=F) gives an estimate of H.

Finally, if we were to use the RNL function in the beginning for the Gaussian example, the values would look almost perfectly the same as they do with nonlinear shrinkage. In fact, the figure below shows simulation results from the paper using our two methods R-NL and a slight adaptation, R-C-NL, and a range of competitors. The setting is almost exactly as in the code above, with the same dispersion matrix and n=300, p=200. The difference is that we now vary the tail-determining parameter of the multivariate t distribution, from 3 (extremely heavy-tailed) to infinity (Gaussian case) on a grid. We won't go into details about what the numbers on the y-axis exactly mean, just that larger is better and 100 is the maximal value. The competitors “R-LS” and “R-GMV-LS” are two linear shrinkage Tyler’s estimators, as mentioned above, while “NL” is nonlinear shrinkage. It can be seen that we are (much) better than the latter for heavy tails and then converge to the same values, once we approach the Gaussian tail behavior.

Simulation results from the paper on arXiv.

Conclusion

This article discussed the paper “R-NL: Fast and Robust Covariance Estimation for Elliptical Distributions in High Dimensions’’. The paper on arXiv contains a wide range of simulation settings, indicating that the R-NL and R-C-NL estimators do exceedingly well in a wide range of situations.

Thus I hope that the estimator(s) can be successfully used in a lot of real applications as well!

References

[1] Tyler, D. E. (1987a). A distribution-free M-estimator of multivariate scatter. Annals of Statistics, 15(1):234–251.

[2] Chen, Y., Wiesel, A., and Hero, A. O. (2011). Robust shrinkage estimation of high dimensional covariance matrices. IEEE Transactions on Signal Processing, 59(9):4097– 4107

R-NL: Robust Nonlinear Shrinkage

High-dimensional covariance estimation when tails are heavy

(Linear-Shrinkage) Tyler’s Estimator

Conclusion

Written by Jeffrey Näf