The world’s leading publication for data science, AI, and ML professionals.

Penalised Regression With The New ASGL Python module

A new Python module to keep by your side

Photo by Ilya Pavlov on Unsplash
Photo by Ilya Pavlov on Unsplash

There are several Python modules for regression and each of them has its specifics and limitations. The use of present Python modules strongly depends on the type of regression that the user wants to perform and its goals. If the regression is simple and the variables are continuous, in many cases the NumPy library has specific methods to tackle this. On the other hand, if one is interested in more complex regression problems with both quantitative and qualitative variables, then the Scikit-learn module has numerous options to offer to the user depending on specific circumstances and problems.

Recently, a new Python module has been made available to the public whose aim is to improve some limitations of current Python modules regarding penalised regression and improve operational performance. The name of the module is asgl (adaptive sparse group lasso). In this article, I discuss what this module is about, what is supposed to do, and if you should use it in your regression problems.

The current state of affairs

Many software give the user the possibility to perform regression at various scales depending on the specific problems. One can do regression in Python, Mathematica, R, and Matlab just to mention a few.

Regression can be divided into several categories and one of the most important is that of penalised regression. This type of regression is usually very important in high dimensionality data when the number of predictors (p) is much larger than the number of observed data (n).

The process of penalisation consists of adding a regulariser to the least square cost function C(x; D). LASSO is one of the most important regularisation methods that adds an _L1 regularisation penalty to the least square cost function. The goal of LASSO is to provide a sparse estimation of the coefficients of the linear model. Other important varieties of LASSO include group LASSO and sparse group LASSO.

The sparse group LASSO is a generalisation of LASSO and group LASSO and its goal is to produce regression solutions that are between and within the group sparse.

The common issue with all LASSO-based methods mentioned above is that they all use constant penalisation rates 𝛌 with the potential to seriously affect the quality of variable selection and accuracy. To solve this issue, some researchers have proposed the so-called adaptive group lasso (asgl) method to calculate the cost function C_(x; D). T_he reader can find the mathematical description of this method in this article. The goal of this method is to provide very good error estimates in both high and low dimensionality datasets.

As I mentioned above, it is possible to do different levels (LASSO, etc.) of regression in Python with Scikit-learn and also Statsmodel. However, since the asgl error estimator is a relatively new concept, this estimator is not included in the Python modules mentioned above. The asgl Python module extends the standard LASSO and group LASSO to the adaptive case for both linear and quantile regression that I describe below.

asgl for linear and quantile regressions

I assume that the reader knows what is linear regression and how it is formally done. However, the reader might be less familiar with the quantile regression. This type of regression was formulated in 1978 by Koenker and Basset and it is adapted for those situations where heteroscedasticity and outliers are present. The goal of this method is to provide an estimation of the conditional quantile of the independent variable as a function of covariates.

The asgl error estimator provides a framework when it is possible to use the adaptive method in the case of linear and quantile regression only. As I mentioned above, the adaptive method is not present in Statsmodel and Scikit-learn in the case of linear and quantile regression. For example, Scikit-learn Python module does not give the user the possibility to perform quantile regression at all, let alone the adaptive case.

In asgl method, the solution for the parameter vector 𝜷 found by using the penalised adaptive method is given by:

where _R(_𝜷) __ is the in-risk function for the linear and quantile regression only. This function in the case of linear regression is given by:

On the other hand, in the case of quantile regression, it is given by:

where the function 𝜌_𝜏 is the so-called loss-check function. In the parameter vector equation above, 𝛌 is the penalisation rate that controls the penalisation weight, K is the number of groups, 𝜷^l are vectors of components of 𝜷 from the l-th _grou_p, in the group LASSO, p_l i_s t_he size of each l-th _group, 𝛼 is a parameter that controls the balance between LASSO and group LASSO. The tilde vectors v and w are t_he weight vectors defined in the asgl model. More details can be found in the original research article.

The key concept brought by the asgl method, that is incorporated in the v and w, is that important variables must have small weight, and thus be lightly penalised, while less important variables must have large weight and be heavily penalised. This method would provide the user with more flexibility by improving accuracy and variable selection.

How to implement asgl in Python?

To use the asgl module/package is quite straightforward as with other Python packages. The installation can be done with the following command:

pip install asgl

The aslg module is based on other python modules such as NumPy (version 1.15 or later), Scikit-learn (version 0.23.1 or later), cvxpy (version 1.1.0 or later). The module also requires a Python version 3.5 or later.

Another possibility would be to use GitHub and pull the following repository:

git clone https://github.com/alvaromc317/asgl.git

Then after the repository pull, you must run the following code to execute the setup.py file:

cd asgl
Python setup.py

What can you do with asgl?

The asgl module uses four main class objects that are: ASGL class, WEIGHT class, CV class, and TVT class. With these classes, one can use the adaptive method described above for real linear and quantile regression problems.

The ASGL class is the most important one and it can be used to perform LASSO, group LASSO, sparse LASSO, and adaptive sparse LASSO. The default parameters of the ASGL class are (for more details see this arXiv article by the asgl module authors):

model = asgl.ASGL(model, penalization, intercept=True, tol=1e-5,lambda1=1, alpha=0.5, tau=0.5, lasso_weights=None,gl_weights=None, parallel=False, num_cores=None, solver=None,max_iters=500)

The ASGL class has three main methods which are: fit, predict, and retrieve_parameters_values. The fit function call method is:

fit(x, y, group_index)

where x is a 2D NumPy array of predictor vectors, y is a 1D independent variable vector and group_index is a 1D NumPy array with a length equal to the number of variables present in the problem.

The predict function call method is:

predict(x_new)

where xnew is a 2D NumPy array with the number of columns equal to the number of columns in the original matrix X._ To make predictions, the user has to run the following command:

predictions = model.predict(x_new)

So, one can see that way of how the modules are used is rather standard as other Python modules such as Statsmodel or Scikit-learn.

The other important method is retrieve_parameter_values which scope is to give the model parameters found by solving the least square regression with the asgl method. To run it one needs to call the method:

retrieve_parameters_value(param_index)

where paramindex is an integer number no larger than the length of the model.coef list. To display the solutions for the parameters, one needs to run:

N 
model.retrieve_parameters_value(param_index = N)

where N is an integer.

As I discussed above, the asgl package has other three main classes which each one having its own importance and it can be used based on the user’s needs. I do not discuss these other classes here, and the reader can consult the asgl package repository for more information.

Conclusion

Above I briefly discussed the new asgl Python package and its main purpose and characteristics. This package gives the user the possibility to use adaptive methods for linear and quantile regressions.

Should you use the asgl module? The short answer to this question is: yes, you should give it a try. However, I recommend that you first understand the theory behind the asgl method and see if you can apply it to your Data Science and machine learning problems. As stated by the authors, this new module gives the user the possibility to perform for the first time quantile regression in Python and also use the adaptive methods provided by the asgl package to improve variable selection and prediction.


If you liked my article, please share it with your friends that might be interested in this topic and cite/refer to my article in your research studies. Do not forget to subscribe for other related topics that will post in the future.


Related Articles