An adaptive lasso for python

How to build an oracle estimator that knows the truth (with code!)

Álvaro Méndez Civieta
Towards Data Science

--

Photo by Pierre Bamin on Unsplash

This is my second post on the series about penalized regression. In the first one we talked about how to implement a sparse group lasso in python, one of the best variable selection alternatives available nowadays for regression models, but today I would like to go one step ahead and introduce the adaptive idea, that can convert your regression estimator into an oracle, something that knows the truth about your dataset.

Today we will see:

  • What are the problems that lasso (and other non-adaptive estimators) face
  • What is the oracle property and why you should use oracle estimators
  • How to obtain the adaptive lasso estimator
  • How to implement an adaptive estimator in python

Problems of lasso penalization

Let me start with a brief introduction of lasso regression. Imagine you are working with a dataset in which you know that only a few of the variables are truly related with the response variable but you do not know which ones. Maybe you are dealing with a high dimensional dataset with more variables than observations, in which a simple linear regression model cannot be solved. For example, a genetic dataset formed by thousands of genes but in which just a few genes are related with a disease.

Image made by the author.

So you decide to use lasso, a penalization that adds an L1 constraint to the β coefficients of the regression model.

Lasso formulation in linear regression.

This way, you obtain solutions that are sparse, meaning that many of the β coefficients will be sent to 0 and your model will make predictions based on the few coefficients that are not 0.

You have potentially reduced the prediction error of your model by reducing the model complexity (the number of variables different than 0). But as a side effect, you have increased the bias of your estimation of β (this is known as the variance bias tradeoff).

Lasso provides sparse solutions that are biased, so the variables that lasso selects as meaningful can differ from the truly meaningful variables.

Image made by the author.

Other penalizations such as ridge regression and sparse group lasso face the same problem: they provide biased solutions and thus can fail to identify the truly meaningful variables in our model.

The oracle property

Our objective then is clear: we want a solution that is not biased, so that we can select the variables from our dataset as if we knew in advance which ones were the truly significant variables. Just like if our estimator were an oracle that knew the truth.

I know, calling “oracle” to a regression estimator can sound like something I came up with, but it actually has a mathematical formal definition, proposed by Fan and Li (2001). An estimator is oracle if it can correctly select the nonzero coefficients in a model with probability converging to one, and if the nonzero coefficients are asymptotically normally distributed.

This means that given a set of p variables {β1,…,βp}, if we consider two subsets,

An oracle estimator selects the truly significant variables with probability tending to one. Asymptotically, both subsets coincide.

The adaptive lasso

So… how can we obtain our oracle estimator? We can use, for example, an adaptive lasso estimator. This estimator was proposed initially by Zou (2006), and the idea behind it is pretty straightforward: add some weights w that corrects the bias in lasso.

If a variable is important, it should have a small weight. This way it is lightly penalized and remains in the model. If it is not important, by using a large weight we ensure that we get rid of it and send it to 0.

But this yields to the last question we will see today:

How to compute these w weights

There are many alternatives for computing these weights, but today I will go with one of the simplest:

  1. Solve a simple lasso model

2. Compute the weights as:

3. Plug in the weights and solve the adaptive lasso

And that’s it. Now your estimator is an oracle, and you will obtain much better predictions (both in terms of prediction errors and in terms of subset selection) than the ones you would obtain by using a simple lasso.

But do not believe me, let’s test this in Python using the asgl package.

Moving into Python code

We start by installing the asgl package, which is available both in pip and as a GitHub repository.

pip install asgl

Import libraries and generate data

First, let’s import the libraries that we will use. We will test the benefit of using an adaptive lasso estimator on a synthetic dataset generated using the make_regression() function from sklearn . Our dataset will have 100 observations and 200 variables. But among the 200 variables only 10 will be related with the response, and the remaining 190 ones will be noise.

x is the matrix of regressors, of shape (100, 200), y is the response variable (a vector of length 100) and true_beta is a vector that contains the true value of the beta coefficients. This way we will be able to compare the true betas with the ones provided by lasso and adaptive lasso.

Train the model

We will compare a simple lasso model against an adaptive lasso model and see if the adaptive lasso actually reduces the prediction error and provides a better selection of meaningful variables.

For this, we consider a train/validate/test split for the dataset. We train the models for the different parameter values using the training set. Then, we select the best model using the validation set, and finally, we compute the model error using the test set (that was not involved in both model training and selection). This can be directly done in the asgl package using the TVT class and the train_validate_test() function.

Lasso model

We will solve a linear model model=lm with a penalization=lasso, and define the values for lambda1, which is the parameter λ associated with the lasso penalization. We will find the optimal model in terms of the minimum mean squared error (MSE), and will use 50 observations for training the model, 25 for validation and the remaining (25) for testing. All this is automatically performed by the train_validate_test() function.

The prediction error from the best lasso model (in terms of MSE) is stored inlasso_prediction_error, and the coefficients associated to the model are stored in lasso_betas

Adaptive lasso model

Now we solve the adaptive lasso model. For this, we specify the penalization=alasso (that stands for adaptive lasso), and we select the technique used for computing the weights as weight_technique=lasso . As described above, this way we will solve an initial lasso model, compute the weights, and then plug this weight into a second lasso model, which will be our final model.

Final results

Finally, let’s compare the results. We will compare two metrics:

  • Prediction error. the MSE achieved by each model. The smaller the better.
  • Correct selection rate: the percentage of variables that were correctly selected (the number of non-meaningful variables considered as non-meaningful by the model and the number of meaningful variables considered as meaningful). This metric represents the quality of the variable selection performed by the model. The larger, the better being 1 the maximum and 0 the minimum.

In the following code snippet, the bool_something variables are used for the computation of the correct selection rate.

The results obtained by the adaptive lasso are much better than those from simple lasso. We see that the adaptive lasso error is almost 8 times smaller than the lasso error (1.4 from lasso compared to 11.8 from lasso). And in terms of variable selection, while lasso only selected correctly 13% of the 200 variables, the adaptive lasso selected correctly 100% of the variables. This means that the adaptive lasso was able to identify correctly all the meaningful variables as meaningful, and all the noisy variables as noisy.

And that’s it on this post about the adaptive lasso. Remember, try to use oracle estimators, as they know the truth of your dataset. I hope you enjoyed this post and found it useful. Please, contact me if you have any question/suggestion.

For a deeper review on what the asgl package has to offer, I recommend reading the jupyter notebook provided in the Github repository, and for a review on oracle estimators, I suggest a recent paper published as part of my Ph.D: adaptive sparse group lasso in quantile regression.

Have a good day!

References

Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

Mendez-Civieta, A., Aguilera-Morillo, M. C., and Lillo, R. E. (2020). Adaptive sparse group LASSO in quantile regression. Advances in Data Analysis and Classifcation.

--

--

Hello, I am a Ph.D. student at University Carlos III of Madrid. I mainly work in high dimensional statistics, but I am also very interested in deep learning.