3 Techniques for Building a Machine Learning Regression Model from a Multivariate Nonlinear Dataset

Everything about Data Transformation, Polynomial Regression, and Nonlinear Regression

Shambhu Gupta
Towards Data Science

--

Photo by Lidia Estaban on Unsplash

A Simple linear regression (SLR) model is simple to construct when the relationship between the target variable and the predictor variables is linear. When there is a nonlinear relationship between a dependent variable and independent variables, things become more complicated. In this article, I’ll show you three different approaches to building a regression model on the same nonlinear dataset:

1. Polynomial regression
2. Data transformation
3. Nonlinear regression

The Dataset:

The dataset that I have considered has been taken from Kaggle: https://www.kaggle.com/datasets/yasserh/student-marks-dataset

The data consists of Marks of students including their study time & number of courses.

DataFrame details (Author Image)

If you examine the relationships of the target variables “Marks” w.r.to study time and number of courses, you will find that the relationship is non-linear.

Nonlinear relationship between Dependent and Independent variables (Author Image)

Challenge with a Simple Linear Model on this Dataset

I tried to build a linear regression model using sklearn LinearRegression() model. I defined a function to calculate various metrics for the model.

and when I called this function for my model, I got the below output.

R2-Square Value: 0.94
RSS: 1211.696
MSE: 12.117
EMSE: 3.481

94% r2-score is not bad, but we will shortly see that this can get better with a non-linear regression model. The problem is more with the assumptions behind the linear regression model.

SLR Assumption 1: Homoscedasticity

Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the error terms with predicted terms we should confirm that there is no pattern in the error terms. However, in this case, we can clearly see that error terms has a certain shape.

Homoscedasticity (Author Image)

SLR Assumption 2: Error terms are normally distributed

A bell-shaped distribution with a normal or nearly normal distribution should ideally be visible for the error terms. However, from the graph below, it is clear that we have a bi-model distribution. As a result, in this case, the assumptions of linear regression were broken.

Distribution of Error Terms (Author Image)

SLR Assumption 3: Error terms are independent of each other

The auto-correlation between error terms should therefore not exist. However, the figures below reveal that the error terms appear to exhibit some degree of auto-correlation.

Auto-Correlations among error terms (Author Image)

So far, We have verified that the data is nonlinear, yet we still build an SLR equation. Although we achieved a respectable r2-score of 94%, none of the SLR assumptions are met. SLR is therefore not a wise solution for this type of data. We will now investigate other techniques for improving the model on the same dataset.

1. Modeling non-linear relationship using Polynomial Regression Model

Non-linear regressions are a relationship between independent variables x and a dependent variable y which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of k degrees (maximum power of x).

y = a x³ + b x² + c x + d

Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example: y = log(x)

Or even, more complicated such as :
y = log(a x³ + b x² + c x + d)

But what happens if we have more than one independent variables?

For 2 predictors, the equation of the polynomial regression becomes:

2-predictor polynomial equations

where,
- Y is the target,
- x1, x2 are the predictors or independent variables
- 𝜃0 is the bias,
- and, 𝜃1, 𝜃2, 𝜃3, 𝜃4, and 𝜃5 are the weights in the regression equation

For n predictors, the equation covers all feasible combinations of various order polynomials. This is known as Multi-dimensional Polynomial Regression and is notoriously difficult to implement. We will construct polynomial models of varying degrees and evaluate their performance. But first, let’s prepare the dataset for training.

We can establish a pipeline and pass the degree and class of models that we wish to utilize to produce a polynomial of various degrees. This is what the code below does for us:

If you wish to view all of the coefficients and intercepts, use the following code block: Please keep in mind that the number of coefficients will vary depending on the degree of polynomial:

And here is the output:

Coefficients/Intercepts for different degrees polynomial regression (Author Image)

This doesn’t give much information on the performance of each model, so will check r2 score.

r2-score for various degree of polynomial regression (Author Image)

So, we built polynomial equations up to degree 7 using sklearn pipeline method and found that degree 2 and above yielded 99.9% accuracy (as compared to ~ 94% by SLR). On the same dataset, we will now see another technique for building a regression model.

2. Modeling Non-linear Relationships using Data Transformation

The linear regression framework assumes that the relationship between the response and predictor variables is linear. To continue utilizing the linear regression framework, we have to modify the data so that the relationship between variables became linear.

Some Guidelines for data transformations:

  • Both the response and the predictor variables can be transformed
  • If the residual plot reveals the presence of nonlinear relationships in the data, a straightforward strategy is to utilize nonlinear transformations of the predictors. In SLR, these transformations can be log(x), sqrt(x), exp(x), reciprocal, and so on.
  • It is critical that each regressor have a linear connection with the target variables. The transformation of dependent variables is one method for addressing the non-linearity issue.

In short, usually:

  • - Transforming the y-values aids in dealing with error terms and may aid in non-linearity.
  • The non-linearity is mostly fixed by transforming the x-values.
  • For further information on data transformation, see https://online.stat.psu.edu/stat462/node/155/.

In our dataset, when we plotted the dependent variables “Marks” against “time of study” and “ number of courses”, we observed that Marks has non-linear relationship with time of study. Hence, we will do a transformation on the feature time of study.

time of study showing non-linear behavior with Marks (Author Image)

After applying the above transformation, we can plot Marks against new feature time_study_sqaured to see if the relationship has changed to linear.

The new feature exhibits a linear relationship (Author Image)

Our dataset is now ready for building a SLR model. On this converted dataset, we will now create a simple linear regression model with the sklearn LinearRegression() method. When we print the metrics after building the model, we get the following result:

R2-Square Value: 0.9996
RSS: 7.083
MSE: 0.071
EMSE: 0.266

A significant improvement over the previously built SLR model on the raw dataset (without any data transformation). We got a R2-Square Value of 99.9% as opposed to 94%. Now, we’ll validate the various assumptions of an SLR model to see if it’s a good fit.

All assumptions of a SLR model is Validated (Author Image)

So In this section, we transformed the data itself. Knowing that feature time_study is not linearly dependent with Marks, we created a new feature called time_study_squared, which was linearly dependent with Marks. Then we built a SLR model again and validated all the assumptions of a SLR model. We observed that all the assumptions are satisfied by this new model. Now, it’s time to explore our next and last techniques to build a different model on the same dataset.

3. Modeling non-linear relationship using Non-Linear Regression Model

For non-linear regression problem, we can try SVR(), KNeighborsRegressor() or DecisionTreeRegression() from sklearn library, and compare the model performance. Here, we will develop our non-linear model using the sklearn SVR() technique for demonstration purposes. SVR supports a variety of Kernels. Kernels enable the linear SVM model to separate nonlinearly separable data points. We will test three alternative kernels with the SVR algorithm and observe how they affect model accuracy:

  • rbf (default kernal for SVR)
  • linear
  • poly

i. SVR() using rbf kernal

And here is the model metrics: Still a better R2-squared compare to our first SLR model.

R2-Square Value: 0.9982
RSS: 4053558.081
MSE: 0.363
EMSE: 0.602

A quick check on the error term distribution also seems to be OK.

Error term distribution for SVR model with rbf kernel (Author Image)

ii. SVR() using linear kernel

And here is the model metrics when we used the linear kernel: The R2-squared values again dropped to ~93%

R2-Square Value: 0.9350
RSS: 4063556.3
MSE: 13.201
EMSE: 3.633

And in this case also, error terms seem to be a near nor mal distribution curve:

Error term distribution for SVR model with linear kernel (Author Image)

iii. SVR() using poly kernel

And here are the model metrics with SVR poly kernel : The R2-squared value is 97%, which is higher than the linear kernel but lower than the rbf kernel.

R2-Square Value: 0.9798
RSS: 4000635.359
MSE: 4.087
EMSE: 2.022

And here is the error terms distributions:

Error term distribution for SVR model with poly kernel (Author Image)

So, In this section, we created a non-linear model using sklearn SVR Model with 3 different kernels. We got the best R2-squared value whith rbf kernel.

  • r2-score with rbf kernel = 99.82%
  • r2-score with linear kernel = 93.50 %
  • r2-score with poly kernel = 97.98 %

Conclusion:

In this post, we started with a dataset that was not linearly dependent on the target variable. Before we could investigate alternative strategies for building a regression model on a non-linear dataset, we constructed a simple linear regression model with a r2-score of 94%. We then investigated three distinct methods for modelling a nonlinear dataset: Polynomial Regression, Data Transformations, and a nonlinear regression model (SVR). We discovered that polynomial degrees of 2 and higher resulted in a 99.9% r2-score, whereas SVR with a rbf kernel resulted in a 99.82% r2-score. In general, whenever we have a nonlinear dataset, we should experiment with several strategies and see which ones work best.

Find the data set and code here: https://github.com/kg-shambhu/Non-Linear-Regression-Model

You can contact me on LinkedIn: https://www.linkedin.com/in/shambhukgupta/

--

--

A Statistics graduate and PGD in AI/ML. Expertise on Statistics, Predictive Analytics, ML, and RPA. Mentoring colleagues at https://www.upgrad.com/