High Dimensional Statistics

Nonlinear Shrinkage: An Introduction

Optimal covariance matrix estimation in high dimensions

Published in

Towards Data Science

9 min readJul 18, 2022

Many applications in statistics, machine learning, and other areas such as finance and biology, need an accurate estimate of a covariance matrix. However, nowadays many of these applications involve high dimensional data and so the usual (sample) covariance estimator just doesn’t cut it anymore. Thus a myriad of papers over the last two decades have been trying to tackle exactly this problem. An extremely powerful method that has emerged from this research is ‘’nonlinear shrinkage’’ ([1]). In this article, I focus on the newest version of this approach, the quadratic inverse shrinkage (QIS) of [2], which makes nonlinear shrinkage easy to implement and fast to compute.

Let’s start with a basic introduction (if you dealt already with covariance matrices you can safely skip this section).

The Basics

A covariance matrix collects the variances and covariances of two or more random variables. That is if we look at random variables X_1, X_2 and X_3, then the covariance matrix is given as:

Since the covariance between X_1 and X_2 and X_2 and X_1 is the same, the matrix is necessarily symmetric. From linear algebra, we moreover know that this matrix has a so-called eigendecomposition, i.e.,

The important part here is the Lambda matrix, which is a diagonal matrix containing the eigenvalues of the covariance matrix:

Thus, not only can we diagonalize the covariance matrix, but these diagonal elements contain important information. For instance, they immediately tell us the rank of the covariance matrix: It is just the number of nonzero eigenvalues. If the matrix has full rank, i.e. if all eigenvalues are larger than zero, it means that the random variables are fully scattered in the p-dimensional space. This is also super useful because

In particular, we see that the covariance matrix is invertible if and only if all eigenvalues are different from zero, or if the rank of the matrix is p.

Every symmetric matrix can be decomposed into the product of orthogonal matrices and a (real) diagonal matrix. The values of the diagonal matrix reveal important properties and can be used to easily calculate the inverse if it exists.

The Problem

Assume now we have a sample of n i.i.d. random vectors in p dimensions, with an expected value of zero (just for simplicity). The p x p covariance matrix can be estimated by the usual sample covariance matrix (scm):

This turns out to be exactly the maximum likelihood estimator if your data follows a Gaussian distribution. This estimator has all kinds of favorable properties and is in particular known to converge to the truth when n goes to infinity and p stays constant (or grows slowly relative to n such that p/n goes to zero).

However, notice that in the above matrix, we have p*(p-1)/2 elements to estimate. p*(p-1) is the number of ways to choose two elements out of p, and then we divide by 2 because the matrix is symmetric. In a sample of n this becomes a problem if n is not much larger than p! This is most easily seen when p > n occurs. In this case the scm, a p x p matrix, can be shown to have at most rank n < p. In particular, even if the true covariance matrix is invertible (= having all eigenvalues larger zero), the scm will have p-n zero eigenvalues and thus never be invertible. So if you need an estimate for the inverse of the covariance matrix, as in linear discriminant analysis or in calculating the value of a Gaussian density, this is a problem.

This might be extreme, but even if you have, say, p=n/2, the estimation error of the scm can be substantial. Let’s now look at a powerful way of solving this issue.

The Solution

I am being edgy here. Of course, there are many many solutions to this problem and depending on the situation some work better than others. Even just changing how you measure success, can change the ordering of methods.

An arguably simple idea that has led to a myriad of research and applications is to just take a linear combination of scm and the identity matrix:

The shrinkage intensity a can be chosen in various ways, though most often it is automatically chosen from data according to theoretical considerations. If we consider the eigendecomposition of the identity and scm, we already see something peculiar:

That is, we simply have new eigenvalues which are a convex combination of the scm ones and a. So if the ith eigenvalue was 2, then the regularized value is now (1-a)*2 + a*1. In particular, since the smallest eigenvalue of the scm is 0, the smallest regularized eigenvalue is now a. So as long as a> 0 holds, the regularized matrix will always be invertible! The formula might also be familiar from L2-Regularized regression:

The principle that the eigenvectors are left unchanged, but the eigenvalues are adapted, is an important one in these shrinkage methods. This makes sense because it is in general hopeless to estimate all the parameters correctly when p is close to or larger than n. But the p eigenvalues are a different story and might be attainable. This is the underlying idea behind nonlinear shrinkage, which takes this a step further.

Estimating the eigenvectors in high dimension is hopeless. So an important principle is to adapt the eigenvalues and just leave the eigenvectors unchanged.

In the linear shrinkage above if we choose a “correctly”, we can actually achieve optimal performance in the asymptotic limit when both n and p go to infinity together. This was established in a 2004 paper ([3]) and shows the optimal eigenvalue estimation if the class of estimators is the linear one above. That is the class of linear shrinkage estimators is simply all estimators of the form (1) when a varies.

Nonlinear shrinkage derives an asymptotic estimator in a much larger class (that need not be just a linear function of the scm). In particular, it solves the problem

where l is some loss function. In words: We are looking for the choice of (shrunken) eigenvalues that brings the resulting matrix as close as possible to the true covariance matrix if we are only allowed to use the eigenvectors of the scm! As we have seen before linear shrinkage is a special case of this because we also only change the eigenvalues there. However, now the way we determine Lambda is not constrained to be linear, offering much more flexibility!

Interestingly the (unattainable) optimal solution of the above is quite intuitive:

where u_j are simply the sample eigenvectors from before:

That is, the best solution for the eigenvalues in that context are not the true ones, but the value we obtain when we apply the scm eigenvectors to the true covariance matrix.

The best thing we can do for a range of loss functions is to estimate the values that result when the sample eigenvectors are applied to the true covariance matrix.

It turns out that the nonlinear shrinkage method is able to estimate these optimal elements consistently when n,p go to infinity together! Let’s now look at an example.

Code Example

So nonlinear shrinkage is beautiful in theory, but is it actually usable? Here is where two very recent papers come in that took nonlinear shrinkage from being a computation-heavy fancy idea into the real world. In particular, the “QIS” method (Quadratic Inverse Linear shrinkage, [2]) can be calculated in a few lines of code! But you don’t even need to do that, since all necessary code is available here.

Let’s look at an application in R, using the qis function. In this example, we take a challenging true covariance matrix that has

So in particular the variances are 1 and the farther away the indices i and j are, the smaller the covariance between them. Thus the elements of the covariance matrix get exponentially closer to zero as we move to the right, but many of them will still be larger zero. Let’s use n=800 and p=1000 in this example, so that p>n.

library(mvtnorm)
source("qis.R")set.seed(1)n<-800
p<-1000
rho<-0.7# Generate the covariance matrix
Sig<-sapply(1:p, function(i) {sapply(1:p, function(j) rho^{abs(i-j)} ) }  )# Take the eigendecomposition of the true covariance matrix
spectral<-eigen(Sig)# Simulate data
X<-rmvnorm(n = n, sigma = Sig)# Take the eigendecomposition of the scm
samplespectral<-eigen(cov(X))# Use QIS and take the eigendecomposition
Cov_qis <- qis(X)
qisspectral<-eigen(Cov_qis)# Rename
qisspectral$U<-qisspectral$vectors
qisspectral$Lambda<-qisspectral$values# Want u_j'*Sig*u_j for all j=1,...,p
whatwewant<-diag( t(qisspectral$U)%*%Sig%*%qisspectral$U  )#check on first value whether its really calculated correctly
(whatwewant[1]-t(qisspectral$U[,1,drop=F])%*%Sig%*%qisspectral$U[,1,drop=F])

The code simulates data from a 1000-dimensional multivariate normal and calculates the sample covariance matrix with its eigenvalues. Notice that this works without problems, even though p > n. This is one of the dangers of scm, it works even when it might not be appropriate. Then the QIS matrix is calculated together with eigenvalues and eigenvectors (which are the same as the eigenvectors of the scm). Finally we calculate the theoretical optimal solution discussed above.

Let’s do some plotting

plot(sort(samplespectral$values, decreasing=T), type="l", cex=0.8, lwd=1.5, lty=1)
lines(sort(spectral$values, decreasing=T), type="l", col="darkblue", cex=0.8, lwd=1.5, lty=2)
lines(sort(whatwewant, decreasing=T), type="l", col="darkred", cex=0.8, lwd=1.5,, lty=3)
lines(sort(qisspectral$Lambda, decreasing=T), type="l", col="darkgreen", cex=0.8, lwd=1.5, lty=4)legend(500, 20, legend=c("Sample Eigenvalues", "True Eigenvalues", "Attainable Truth", "QIS"),
       col=c("black", "darkblue", "darkred", "darkgreen"), lty=1:4, cex=0.8)

that gives

Why is this plot interesting? First, we see that the sample eigenvalues are quite off — they overshoot the largest eigenvalues and underestimate the smallest. In particular, the last 1000–800= 200 eigenvalues are zero. This ‘’overdispersion’’ is well-known in high dimensions, small eigenvalues get estimated too small and large ones too large. On the other hand, we see that the nonlinearly shrunken eigenvalues (in green) are quite close to the true ones in blue. More importantly, they are very close to the red line, which is the attainable truth above, i.e. u_j’*Sig*u_j!

Indeed then the estimate of the overall matrix is over 30% better:

((norm(cov(X)-Sig,type="F")-norm(Cov_qis-Sig, type="F"))/norm(Cov_qis-Sig, type="F"))
[1] 0.3088145

This difference can also be much larger depending on the form of the true underlying covariance matrix. Importantly QIS performs about the same as the usual covariance matrix estimator when p is very small compared to n and gets (much) better as soon as p grows relative to n. Thus if n is reasonably large (n > 100) it might even be beneficial to just always use QIS directly!

Conclusion

This article gives an introduction into the method of nonlinear shrinkage of the sample covariance matrix. After giving a more conceptual/mathematical introduction, a small code example illustrated the method in R. Code for the method is also available in Matlab and Python.

There are also a lot of real applications where the method has been used extensively, in particular in finance. Do you know other applications where this could be useful, from you own work? I am always happy to hear about interesting use cases and datasets.

References

[1] Ledoit, O. and Wolf, M. (2012). Nonlinear Shrinkage Estimation of Large-Dimensional Covariance Matrices. The Annals of Statistics, 40(2):1024–1060.

[2] Ledoit, O. and Wolf, M. (2022). Quadratic Shrinkage for Large Covariance Matrices. Bernoulli, 28(3):1519–1547.

[3] Ledoit, O and Wolf, M (2004). A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices. Journal of Multivariate Analysis, 88(2):365–411.