The world’s leading publication for data science, AI, and ML professionals.

Top 4 Linear Regression Variations in Machine Learning

A Beginner Friendly Guide to Implementation and Comparison

Machine Learning Algorithms for Regression (original image from my website)
Machine Learning Algorithms for Regression (original image from my website)

In my previous post "Top Machine Learning Algorithms for Classification", we walked through common classification algorithms. Now let’s dive into the other category of supervised learning – regression, where the output variable is continuous and numeric. Mainly, how to implement and compare four common types of regression models:

If you prefer a video walkthrough, please check out my YouTube video at the end of this article.


Linear Regression

Linear regression finds the optimal linear relationship between independent variables and dependent variables, thus makes prediction accordingly. The simplest form is y = b0 + b1x. When there is only one input feature, linear regression model fits the line in a 2 dimensional space, in order to minimize the residuals between predicted values and actual values. The common cost function to measure the magnitude of residuals is residual sum of squared (RSS).

linear regression (image by author)
linear regression (image by author)

As more features are introduced, simple linear regression evolves into multiple linear regression y = b0 + b1x1 + b2x2 + … + bnxn. Feel free to check out my article if you want the specific guide to simple linear regression model.

A Practical Guide to Linear Regression

Lasso Regression

lasso regression (image by author)
lasso regression (image by author)

Lasso regression is a variation of linear regression with L1 Regularization. Sounds daunting? Simply put, it adds an extra element to the residuals (RSS) that regression models are trying to minimize. It is called L1 regularization because this added regularization term is proportional to the absolute value of coefficients – degree of 1. The term above is based on the simplest linear regression form y = b0 + b1x.

Compared to Ridge Regression, it is better at bringing the coefficients of some features to 0, hence a suitable technique for feature elimination. You’ll see in the later section in "Feature Importance".

Ridge Regression

ridge regression (image by author)
ridge regression (image by author)

Ridge regression is another regression variation with L2 regularization. So not hard to infer that the regularization term is based on the squared value of coefficients – degree of 2. Compared to Lasso Regression, Ridge Regression has the advantage of faster convergence and less computation cost.

regularization strength (image by author)
regularization strength (image by author)

The regularization strength of Lasso and Ridge is determined by λ value. Larger λ values shrink down the coefficients which makes the model more flattened and with less variance. Therefore, regularization techniques are commonly used for preventing model overfitting.

Polynomial Regression

polynomial regression (image by author)
polynomial regression (image by author)

Polynomial regression is a variation of linear regression transformed by polynomial feature. It adds interactions between independent variables. PolynomialFeatures(degree = 2)is applied to transform input features to a maximum degree of 2. For example, if the original input features are x1, x2, x3, this expands features into x1, x2, x3, x1², x1x2, x1x3, x2², x2x3, x3². As the result, the relationship is no longer linear, instead able to provide a non-linear fit to the data.


Regression Models in Practice

After all of the theory, time to implement and compare these regression models, and explore how different lambda values affect model performance.

Please check out code snippet if you are interested in getting the full code of this project.

1. Objectives and Dataset Overview

This project aims to use regression models to make prediction of the country happiness scores based on other factors "GDP per capita", "Social support", Healthy life expectancy", "Freedom to make life choices", "Generosity" and "Perceptions of corruption".

I used "World Happiness Report" dataset on Kaggle, which includes 156 entries and 9 features. df.describe()is applied to provide an overview of the dataset.

dataset overview (image by author)
dataset overview (image by author)

2. Data Exploration and Feature Engineering

1) drop redundant features

Feature "Overall rank" is dropped as it is a direct reflection of the target "Score". Additionally, "Country or Region" is dropped because it doesn’t bring any values to the prediction.

2) univariate analysis

Apply histogram to understand the distribution of each features. As shown below, "Social support" appears to be heavily left skewed whereas "Generosity" and "Perceptions of corruption" are right skewed – which informs the feature engineering techniques for transformation.

# univariate analysis
fig = plt.figure(figsize=(16, 8))  
i = 0
for column in df:
    sub = fig.add_subplot(2,4 , i + 1)
    sub.set_xlabel(column)
    df[column].plot(kind = 'hist')
    i = i + 1
univariate analysis (image by author)
univariate analysis (image by author)

We can also combine the histogram with the measure of skewness below to quantify if feature is heavily left or right skewed.

skew_limit = 0.7
for col in df.columns:
    skewness = df[col].skew()
    if skewness + skew_limit < 0:
        print(col, ": left skewed", str(skewness))
    elif skewness > skew_limit:
        print(col, ": right skewed", str(skewness))
    else: 
        print(col, ": not skewed", str(skewness))
skewness measure
skewness measure

3) square root transformation

np.sqrt is _ applied to transform right skewed features – "Generosity" and "Perceptions of corruption_". As the result, both features become more normally distributed.

square root transform (image by author)
square root transform (image by author)

4) log transformation

np.log(2 - df['Social Support']) is applied to transform left skewed feature. And the skewness significantly reduce from 1.13 to 0.39.

log transformation (image by author)
log transformation (image by author)

5) bivariate analysis

_sns.pairplot(df)can be used to visualize the correlation between features after the transformation. The scatter plots suggest that "GDP per capita", "Social support", "Healthy life expectancy" are correlated with target feature "Score"_, hence may have higher coefficient values. Let’s find out if that’s the case in the later section.

6) feature scaling

Since regularization techniques are manipulating the coefficients value, this makes the model performance sensitive to the scale of features. So features should be transformed to the same scale. I experimented on three scalers – StandardScaler, MinMaxScaler and RobustScaler.

Check out my article on "3 Common Techniques for Data Transformation" for more comprehensive guide of data transformation techniques.

3 Common Techniques for Data Transformation

Please note that the scaler is fit using the training set only and then apply the transform to both training and testing set. So, dataset should be split first.

from sklearn.model_selection import train_test_split
X = df.drop(['Score'], axis=1)
y = df['Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape, X_test.shape)

Then, iterate through these 3 scalers to compare their outcome.

feature scaling code (image by author)
feature scaling code (image by author)

As you can see, scalers won’t affect the distribution and shape of the data but will change the range of the data.

feature scaling comparison (image by author)
feature scaling comparison (image by author)

For more comprehensive guides of EDA and Feature Engineering, check out my curated list.

EDA and Feature Engineering

3. Regression Model Comparisons

Now let’s compare three linear regression models below – linear regression, ridge regression and lasso regression.

lr = LinearRegression().fit(X_train, y_train)
l2 = Ridge(alpha = 0.1).fit(X_train, y_train)
l1 = Lasso(alpha = 0.001).fit(X_train, y_train)

1) Prediction Comparison

Firstly, visualize the predicted values vs actual values of three models in one scatter plot, which suggests that their predictions mostly overlapped with each other under current parameter settings.

regression prediction comparison (image by author)
regression prediction comparison (image by author)

2) Feature Importance

The second step is to experiment how different lambda values(alpha in scikit-learn) effect the models. Specifically, how feature importance and coefficient values alter as alpha value increased from 0.0001 to 1.

feature importance code (image by author)
feature importance code (image by author)
feature importance Lasso vs. Ridge (image by author)
feature importance Lasso vs. Ridge (image by author)

Based on coefficients values generated from both Lasso and Ridge models, "GDP per capita", "Social support", "Healthy life expectancy" appear to be the top 3 highest importance features. This is aligned with the findings from previous scatter plots, suggesting that they are the main drivers of "Country Happy Score". The side by side comparison also indicates that the increase in alpha values impact Lasso and Ridge at different level, features in Lasso are more strongly suppressed. That’s why Lasso is often chosen for the purpose of feature selection.

3) Apply Polynomial Effect

Additionally, polynomial features are introduced to enhance baseline linear regression – which increases the number of features from 6 to 27.

from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree = 2, include_bias = False)
X_train_poly = pf.fit_transform(X_train)
X_test_poly = pf.fit_transform(X_test)

Have a look of their distribution after the polynomial transformation.

polynomial feature univariate analysis (image by author)
polynomial feature univariate analysis (image by author)

4. Model Evaluation

Last step, evaluate the performance of Lasso Regression vs. Ridge Regression model performance, before and after polynomial effect. In the code below, I implemented four models:

  • l2: Ridge regression without polynomial features
  • l2_poly: Ridge regression with polynomial features
  • l1: Lasso regression without polynomial features
  • l1_poly: Lasso regression with polynomial features
regression model evaluation (image by author)
regression model evaluation (image by author)

Common regression model evaluation metrics are MAE, MSE, RMSE and R Squared – check out my article on "A Practical Guide to Linear Regression" for detailed explanation. Here I used MSE (mean squared error) to evaluate model performance.

1) By comparing Ridge vs. Lasso in one chart, it indicates that they have similar accuracy when alpha values are low but Lasso significantly deteriorates when alpha is closer to 1. Same pattern is observed for before and after polynomial transformation.

Ridge vs. Lasso in one chart (image by author)
Ridge vs. Lasso in one chart (image by author)

2) By comparing before vs. after polynomial effect in one chart, we can tell that polynomial decreases MSE in general – hence enhance model performance. This effect is more significant in Ridge regression when alpha increases to 1, and more significant in Lasso regression when alpha is closer to 0.0001.

Before vs. After polynomial effect in one chart (image by author)
Before vs. After polynomial effect in one chart (image by author)

However, even if polynomial transformation improves the performance of regression models, it makes the model interpretability more difficult – it is hard to tell the main model drivers from a polynomial regression.

Less error does not always guarantee a better model, and it is about to find the right balance between predictability and interpretability based on the project objectives.


Thanks for reaching the end. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership.


Take Home Message

Hope this article provides a general idea of different types of regression model, including:

  • linear regression
  • ridge regression
  • lasso regression
  • polynomial regression

We also walkthrough:

  • Essential EDA and feature engineering technique for regression models.
  • Feature importance at different alpha values
  • Model comparison between Lasso and Ridge
  • Model comparison before and after polynomial features

More Articles Like This

Top 6 Machine Learning Algorithms for Classification

Practical __ Guides to Machine Learning

Get Started in Data Science


Originally published at https://www.visual-design.net on March 20th, 2022.


Related Articles