Predictive Modelers’ Guide To Choosing The Best Fit Regression Model for Beginners

How to choose the best fit Regression model

Freeman Goja
Towards Data Science

--

When looking for the best fit model for prediction, finding the right algorithm has always proven to be the difference between the success and failure of the entire project. For newbies to machine learning like I was and every other machine learning expert once was, finding the best fit model could be very confusing and that is usually not only because of low-level coding ability but mainly due to poor understanding and application of the concepts and the tendency to easily fall for the wrong metrics.

In this article, am going to walk through the concepts and step by step guide to building the best fit model in 3 major steps to solve any regression problem. It is assumed that you are at least aware of the concepts of these models so I will be focusing more on their applications. We shall be comparing the performance of the following machine learning algorithms:

Linear Regression

K-Nearest Neighbour (KNN) Regressor

Decision Tree Regressor

Random Forests Regressor

Adaboost Regressor

XGBoost Regressor, and

Support Vector Machine (SVM) Regressor

Metrics: First, let’s start with the metrics because the metrics chosen to evaluate the model are very important. R-Squared (R²), Adjusted R-Squared (Adj R²), Mean Square Errors (MSE) and Root Mean Squared Errors (RMSE) are very popular metrics for regressors.

R² Regression Score or Coefficient of determinant is an intuitive statistical scale that measures the proportion of the change in a dependent variable that’s explained by an independent variable(s) in a regression model. A constant model that always predicts the expected value of y regardless of the input features would get a R² value of 0 while a perfect fit model has R² of 1.0. R² value can be negative for a model that performs arbitrarily worse. Generally, R² is a measure of the relative fit of a model. This metric, however, has a pitfall that is worthy of note here as R² score tends to always increase with additional features without necessarily improving the model’s fit. To overcome this drawback, another metric, Adjusted R² which is interpreted as the proportion of total variance that is explained by the model made up of all the independent variables is preferred. Adjusted R² takes into consideration the degrees of freedom. It increases when the addition of more features to the model improves the model’s fit and decreases otherwise.

The Mean Square Errors (MSE) is a measure of the closeness of a regression line to a set of points. It measures the distances called errors or variance of the residuals from the points to the regression line and squares them to remove any negative signs. The Root Mean Squared Errors (RMSE) takes the square root of MSE and indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. RMSE is relatively easier to interpret in that it comes in the same unit as the response variable. Lower values of RMSE mean that the regression line is close to the data points, indicating a better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.

Having established the basic concepts, let’s start getting our hands dirty. I have chosen to suppress the warnings so as to avoid them appearing in the output and making it messy. Word of caution when hiding all warnings is that you must be sure your code is free from errors.

To make the code look more professional, I find importing all required libraries in one place as a nice way to start.

1. Set directory and import relevant libraries

Import data, explore, pre-process and prepare it for modelling.

2. Explore and prepare data

Explore the data to learn about its dimensions, data types, basic statistics etc. Depending on the dataset, categorical variables and missing values may have to be handled. In this example, there are no null values as can be seen above but I had to encode the categorical values in cd, multi and ads.

Another important thing to check is outliers. The way you prepare a dataset with outliers for modelling is different from the one without outliers as you will see later on with scaling. Here, you can see outliers in price and hd in the figure above. While price outliers would not be a concern because it is the target feature,the presence of outliers in predictors like hd, in this case, would affect the model’s performance. Detecting outliers and choosing the appropriate scaling method to minimize their effect would ultimately improve performance. You can find the full code and dataset in my Github repository here.

Since we are dealing with regression, it is important to see the correlation between the predictors and the target variable. This can be achieved with the use of the correlation heatmap and matrix as shown below.

From the correlation matrix, we can see that there is varying extent to which the independent variables are correlated with the target. Lower correlation means weak linear relationship but there may be a strong non-linear relationship so, we can’t pass any judgement at this level, let the algorithm work for us.

Talking about scaling, there’s a wide range of scalers in sklearn including MinMaxScaler, StandardScaler, RobustScaler, minmax_scale, MaxAbsScaler, Normalizer, QuantileTransformer and PowerTransformer. I compared the commonly used ones in this article to find the most suitable. Feel free to add more or choose according to your requirements. The program below visualizes before scaling and the three chosen types.

MinMaxScaler: As you can see from the second figure above, it re-scales all feature values in the range [0, 1]. However, it is very sensitive to the presence of outliers.

StandardScaler: This scaler disregards the mean and scales the data to unit variance. However, outliers influence both mean and standard deviation. Therefore, StandardScaler like MinMax scaler doesn’t guarantee balanced feature scales in the presence of outliers.

RobustScaler: Unlike the first two, RobustScaler is based on percentiles and hence not easily influenced by outliers.

PowerTransformer: Applies a power transformation to each feature to make the data more Gaussian-like.

MaxAbScaler: Similar to MinMaxScaler but used on positive-only data and also suffers from the presence of large outliers.

QuantileTransformer: Matches a Gaussian distribution instead of a uniform distribution and introduces saturation artifacts for extreme values.

The ultimate choice depends on either domain knowledge of the dataset or trial and error. From the figure above, it can be seen that only RobustScaler returned a nice distribution with outliers outside the bulk of the distribution and therefore, it is the obvious choice needed for optimum performance.

3. Train the Model and Make Predictions

Now let’s split the dataset and import the appropriate modelling libraries.

Before building the models, I want each model to perform at its best so it’s important to do feature selection for Linear Regression and tune the hyper-parameters for XGBoost, AdaBoost, Decision Tree, Random Forests, KNN and SVM to find the best parameters to use in the models.

Linear Regression: When building a Linear Regression model, there’s no need to include unnecessary features that do not contribute reasonably to the change in the target variable. With backward elimination, you can deselect features that have p_values > 0.05 which is interpreted as a failure to explain the change in the target other than chance. The following code returns only the selected features with p_values < 0.05.

See the selected features’ p_values:

Tuning hyperparameters is one key method of optimizing the performance of hyper-parametric models as shown below.

XGBoost: The common hyper-parameters to tune here include; learning_rate, max_depth and n_estimators.

AdaBoost: learning_rate and n_estimators

Decision Tree: max_depth

Random Forests: max_depth and n_estimators

KNN: When we have features with values at different scales, it is very important to standardize them to the same range so that the algorithm does not weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values. This is crucial for distance-based algorithms such as KNN. We found earlier that RobustScaler was the most suitable scaler for this dataset so we shall be using it here for scaling.

The common hyperparameters to tune here are n_neighbors and p.

SVM: Similarly, scaling is important for SVM. The hyper-parameters to tune here are gamma, C and kernel.

Having selected the best features and tuned the hyper-parameters, it’s time to build optimized models using those parameters. This ensures that each model performs at its best on the dataset. Below is a demonstration of the use of the values we have found.

Now let’s predict

Done with the dirty work now let’s look at the results.

R² Scores:

Adjusted R²

RMSE Values:

As expected, the Adjusted R² score is slightly lower than the R² score for each model and if we evaluate based on this metric, the best fit model would be XGBoost with the highest Adjusted R² score and the worst would be AdaBoost with the least R² score. However, recall that this metric is only a relative measure of fitness so, we must look at the RMSE values. In this case, XGBoost and AdaBoost have the lowest and highest RMSE values respectively and the rest models are in the exact same order as their Adjusted R² scores. This further confirms that the best fit model for this dataset is XGBoost and the worst fit model is AdaBoost. Note, this doesn’t always happen so, be careful. Generally, when you have a model with the highest Adjusted R² and high RMSE, you would be better off with the one that has moderate Adjusted R² and low RMSE as the latter is the absolute measure of fit.

Looking at the plots of predicted vs actual prices, you can also see that the data points in XGBoost are closer to each other and farther apart in AdaBoost.

Thank you for stopping by and I hope you enjoyed this article and found it helpful. You can find the full code and dataset on GitHub here.

Linkedin Profile: https://www.linkedin.com/in/freemangoja

--

--