The world’s leading publication for data science, AI, and ML professionals.

Multiple Linear Regression Python 101

Step-by-step guide for data preparation and predictive modeling

Getting Started

Source: Adobe Stock License
Source: Adobe Stock License

Starting out building your first multiple linear regression predictive model using Python can feel daunting! This post offers a practical workflow, guide, and example code of one approach that builds on CRISP-DM. I hope you’ll find it useful and welcome your comments.

Conceptual Workflow for Linear Regression

The CRoss Industry Standard Process for Data Mining is a leading process model that describes the data science life cycle. This project follows the below tactical workflow in building a linear regression model. The process diagram sequences sub-tasks for four CRISP-DM processes spanning Data Understanding, Data Preparation, Modeling and Evaluation.

Source: Author
Source: Author

The simple ideas inherent in the process flow include:

  • Split your data before too much profiling and analysis
  • Iterate through a common set of data prepration tasks for each feature
  • Iterate through a variety of models containing different features
  • Use functions for consistency in data transformations and code efficiency across the project

Now let’s get started and walk-through a project step-by-step.


Data Preparation Work Stream

I’m using a King County, WA home sales dataset which is popular on Kaggle and with data science bootcamps. The data contains 21 columns across >20K completed home sales transactions in metro Seattle spanning 12-months between 2014–2015. The multiple linear regression model will be using Ordinary Least Squares (OLS) and predicting a continuous variable ‘home sales price’.

The data, Jupyter notebook and Python code are available at my GitHub.

Step 1 – Data Prep Basics

To begin understanding our data, this process includes basic tasks such as:

  • loading data
  • combining and integrating data
  • converting data types
  • fixing data quality issues
  • removing duplicates
  • steps to make the data clean, formatted and usable

I bundle all my library imports together in a single place for easy access.

The next step is to load our .csv data into a Pandas dataframe and take a peek.

The file contain standard attributes of a home listing; a data dictionary can be found in my notebook.

At first glance, notice fractional values in bathrooms and "nan" nulls in yr_renovated and waterfront. Below using DataFrame.info(), we notice data type and null issues that need addressing.

So based on this and a few Series.value_counts() not shown here, we can make adjustments to clean our housing sales data.

Now let’s take a look at basic statistics and ranges using Dataframe.describe().

Review of the above table highlights:

  • Price: skewed positive with mean higher than median. Outlier at $7.7M.
  • Bedrooms: Outlier at 33
  • Sqft_Lot: Obvious outlier at 1.6M, and mean is nearly double median.
  • Waterfront: Sparsely populated, very few 1’s.
  • Yr_Renovated: Small proportion of homes given 75th percentile is 0.

After some consideration, I dropped a single record with 33 bedrooms. Outside of this record, only a handful of homes had 8 or 9 bedrooms so I just dropped this one as a likely data entry error. Even though there were other outliers on several attributes, I did not filter or drop any other records.

Dealing with duplicates, merging other data or any other general data prep tasks can be included in this Data Prep Basics process. Now the data is consistent and ready to be split for data profiling, correlation analysis and feature engineering.

Step 2 – Split Data Into Training and Test Sets

For a predictive model that will be used on new data to predict in production, splitting the data prior to profiling and analyzing help prevents data leakage.

data leakage: when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

Given this analysis is one-time historical effort, I split the data after doing correlations. This means I developed features using the entire sample. I’m showing the split logic at this Step 2, however, to align with the best practice diagram above.

The Python code snippets above create the X (predictors) and y (target) datasets. As target dependent variables, I included price, the log of price and the box-cox transformation of price. These three target options were developed iteratively during model development.

Step 3 – Data Profile & Transform

Profiling and visually exploring data helps size up potential features and relationships. While any type of exploration is fair game, I focus iteratively on these objectives _for each predictor X (e.g. bedrooms) and target y (e.g. pricelog).

  • Check variable trends over time
  • Confirm linear relationships (X-y)
  • Investigate outliers
  • Assess variable distributions, normality and skew
  • Convert variable classes with one-hot encoding

Category Variable Plots

I created a function (called distplots, available in my Jupyter notebook) to plot any X predictor with a histogram, scatterplot, boxplot and top N cumulative values. A variable I assumed would be in my model was bedrooms since this is often a differentiating descriptor when buying or selling a home.

Reviewing the output, a couple of observations:

  • After removing the outlier of 33 bedrooms, only 0.3% of homes have more than 6 bedrooms
  • The median logically falls somewhere between 3 and 4 bedrooms
  • The box-plot shows an increasing median price with number of bedrooms

…but ultimately the correlation was weak between bedrooms and price at only 0.32 and did not move my model performance so I left it out.

I also used the distplots function to visualize either other variables.

Average Living Space Trend

If you have time-based variables, analyzing trends is useful to see direction or momentum. For instance, average living space in homes has increased over the past century. Since some older homes had higher bedroom counts, I plotted a 2nd axis to see living square footage per bedroom. The code here is for a dual-axis bar plot with a trend line that used the Seaborn and Matplotlib libraries.

The chart shows living space "taking off" during the 1970’s in the Seattle area. For homes built since-1970, we see a 32% increase in living square footage, a 9% increase in average bedrooms and a 21% increase in square footage per bedroom. The feature sqft_per_bedroom made the final model cut as a complimentary feature to living_sqft.

Pairplot Association Analysis

Another data profiling technique is to visually inspect intersections of variables for associations. Here I’ve used Seaborn’s pairplot which creates a scatterplot for each combination, leaving a histogram on the diagonals.

  • Linear with price: Grade, Condition, Bathrooms
  • Inconclusive or non-linear with price: Year Built, Indicators, Bedrooms

One-Hot Encoding

A common approach is one-hot encoding (OHE) to pivot categorical variables into features. I tested several OHE variables such as condition, grade, bedrooms and bathrooms to see if they performed better than as a continuous variable. I found each of those was better off as a continuous variable.

I ended up with two OHE variables: zipcode and grade_grp (simplifying the residential building groups from 3–13 into 4 categories). I used Pandas .get_dummies() method to pivot the feature to OHE columns, dropping the first class to minimize pure multicollinearity risk.

Step 4 – Correlation Analysis

The next step is looking at correlations with these goals:

  • Validate which predictor features (X) correlate to target (y)
  • Check correlations between predictors for multicollinearity

Pearson’s Correlation Coefficient (r) – Target y

Review Pearson correlation coefficients for each feature (X) against the predictor price_log (y). Correlations will provide insight into features that may perform well in a linear regression model. I used these results as well as correlation matrix heat maps as a guide on which features to include.

The four starred features had moderate or strong correlations and made the final OLS model.

  • Strong correlations (0.6–0.8): sqft_living, grade, sqft_above, zip_psf_median, sqft_living15
  • Moderate correlations (0.4–0.6): bathrooms, view

Check Multicollinearity

When independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. – Jim Frost "Regression Analysis"

Reviewing correlations between predictors is a key step to avoid picking features that have strong multicollinearity that may invalidate the model. The code block creates pairs of variables and sorts them by highest correlation.

Highlighting two different scenarios (starred above):

  • Sqft_living with sqft_above 0.876 correlation – I tried in my Model #3 to incoporate sqft_above but abandoned this due to multicollinearity.
  • Grade, Sqft_living at 0.76 correlation – even though highly correlated, I was able to include both features without significant issues.

The recommendation here is to be highly aware of predictor correlations using Pearsons or other statistics like Variance Inflation Factor (VIF).

Step 5 – Engineer New Features

The first time through a data set when creating a linear regression model, you may not initially need any new features. But after your first model iteration, potential transformations include:

  • Add a confounding variable to reduce model bias
  • Convert a feature to per-capita or express as a rate
  • Power transform a feature using inverse, log, square root or box-cox
  • Add a polynomial feature (e.g. X²) to better fit curvature
  • Add an interaction term to tease out shifts in linear relationships between two predictors

For the King County home price sales predictions, I explored all of the above except polynomials. For my baseline Model #1 – let’s just see how simple untransformed data works first.

  • Target variable: price (untransformed)
  • Predictor variables (top 6 correlations): sqft_living, sqft_above, sqft_living15, bathrooms, view, grade (all untransformed)

Yikes!

While Model #1 registered a 0.577 adjusted r-squared and a significant p-value, it had massive skew and multicollinearity as the QQ-plot of overall model residuals show.

We clearly need to look at predictor distributions and consider other features.

Check Normality and Create Log Functions

Given non-normal residuals and skewed price distribution, I created Python functions check_normality and create_log which takes in a dataframe column, creates a log series and displays dual histograms for comparison.

Using these functions, I log transformed the King County price series to price_log. The log series as shown below is more normalized with skew close to 0 and kurtosis of just under 1. Prior to transformation, price had a skew over 4, kurtosis of 35 and nearly 200 observations above the 4th standard deviation. As a guide, ideal normality would target a skew near 0 and kurtosis between 0–6 with 3 being ideal.

Linear regression assumptions do not require that dependent or independent variables have normal distributions, only normal model residuals. I settled on price_log as my dependent variable for model iterations Model #2 through Model #8 since it improved performance and removed residual plots skew.

Create Indicator Features – Renovation and Basement

After data profiling and trying to use the year renovated as a feature, I settled on including categorical indicator features renovation_ind and basement_ind with 1’s representing ‘yes’.

Median Price Per Square Foot By Zip Code

Incorporating zipcode in the model is important since prices vary greatly by geographic area. In Model #6, I first used 70 zip codes as one-hot encoded categorical variables, but abandoned this model as residuals were very poor. Instead I added a feature "median price per square foot" that I applied to the data by zipcode. The box-cox transformed version of this feature ended up significant so was included in the final Model #8.

Miles From Seattle City Center

This feature used each home’s latitude and longitude to measure the distance to the center of Seattle. This feature was actually created before the above zip-code price-per-square-foot based on the idea that prices are generally higher closer to urban population centers. Initially this feature was very significant, but lost most of it’s influence once included alongside the zip-code median price-per-square-foot.

Step 6 – Select Features

The final step of data preparation is to chose which features to include in the next iteration of the regression model. This experimentation process is informed by your completed data analysis and prior model results.

My approach is to make a copy of train and test data containing only the features selected so I have a set of dataframes and model results dedicated to each model iteration.

I saved 8 model iterations in my notebook but ran at least twice that many experimenting on features, transformations, scaling and so forth. Four of my iterations I considered statistically valid while four I abandoned due to performance, residuals or multicollinearity. I stopped at Model #8 which contained the following features.


Modeling Work Stream

The Select Features decision is our transition point from the Data Preparation to the Modeling work stream. For this section, I’ll focus only on the final Model #8 process and results. Within each section Model Build and Model Predict below, I’ve included the relevant Model Evaluate steps.

Source: Author
Source: Author

Step 7 – Feature Scaling

Before building (fitting) a linear regression model, the step of Feature Scaling encompasses any approach you might need to standardize (normalize), center or otherwise scale a feature. Common sklearn functions you might employ include StandardScaler, MinMaxScaler or RobustScaler. For my final model, I did not scale any features.

Step 8 – Model Build

Function calc_sm_ols takes in X and y dataframes and fits a statsmodel Ordinary Least Squres (OLS) regression model. The function prints the model summary, root mean squared error (RMSE) and Mean Absolute Error (MAE), handling conversion of price_log back to interpretable form.

Let’s fit our features for Model #8 on our training data and see how we look.

The yellow boxed statistics are key performance components:

  • Adjusted R-squared: 0.745 indicates model explains 74.5% of variance. This is the best of my models where I felt comfortable with the residuals.
  • Overall p-value: 0.00 indicates we can reject the null hypothesis; the model results are significant with p-value under our alpha of 0.05.
  • Akaike information criterion (AIC): 3329 was the lowest of the four models I consider statistically valid, indicating the best.
  • Feature p-values: All 10 features had p-values under 0.05, so we can reject the notion that results for any feature were due to randomness.
  • Skew: Model scored near 0 at 0.041 indicating residual distribution is close to normal.
  • Kurtosis: Score of 4.395 is within desired range of 0–6 , near a normal distribution kurtosis of 3. This suggest slightly heavier tails on residual distribution than normal, but acceptable.
  • JB and Prob(JB): Based on near 0 probability (JB), this further supports that we have near-normal distribution of regression errors.
  • Condition Number: 629 likely indicates multicollinearity amongst features, but not enough to throw a warning on the OLS summary. Statsmodel documentation suggests this value should be below 20.
  • Mean Absolute Error (MAE): $107,812 indicates the average error in predicting a home price. I focused on MAE over RMSE to reduce the impact of squaring large outlier residuals. Keep in mind the home prices range from under $100K to over $7M.

So in summary the overall model results looked solid, not perfect, but the best I had come up with given available data and my time allotment.

Overall QQ-Plots

Let’s use our function qqplot to visually inspect overall model residuals.

When reviewing a QQ-plot, perfectly normalized residuals would follow the red diagonal. Both of our tails here drop off, especially the last point on each end. These represent outliers where the model isn’t predicting correctly, generating larger RMSE values. In future iterations I’d work to improve this.

Feature Residual Plots

We also need to review the residuals distribution for each independent variable to confirm each is homoscedastic. If we see patterns instead of randomness in the residuals, our assumptions may not be trustworthy and we may be overestimating goodness of fit. The function plot_residuals loops through each feature to plot residuals.

Of the 10 features, 9 looked great and 1 had a trend in it. Let’s first look at an example where residual plots look good on sqft_living_log feature. In the top right box, you can see the distribution of residuals cluster both above and below 0 in a mostly random pattern. This is what we want to see.

On our zip_psf_median_box feature, the residuals do show a slight downward declining trend. This likely indicates this feature needs some adjustment, or we have a confounding variable we should incorporate in future iterations.

K-Folds Cross Validation

A final step to get comfortable with the model is to use k-folds from sklearn.model_selection. I’ve used 5 splits on our complete sample to ensure every observation from the original dataset has the chance of appearing in training and test sets iteratively.

Cross-validation using 5 k-folds results in mean MAE of 0.2123 versus our OLS model summary of 0.19907 (on price_log). While a 6.6% difference is slightly higher than I’d like, I can feel confident that the model reasonably held up.

I used both a train-test split (80/20) but then also applied this k-folds to the entire sample.

Step 9 – Model Predict

Now in the final stretch, let’s use our X_test and y_test data that have not yet been exposed to our model. The function "predict" below takes in both test and train data to apply the model. The outputs include a summary of RMSE and MAE comparison for evaluation.

This comparison looks solid as the predictions on test produced an MAE of $108,268 which is less than 1% different than the train MAE. We’re ready to move on to our final selection considerations.

Step 10 – Model Compare & Select

Only 4 of 8 formal models I developed were deemed worthy. The recommended Model #8 was by far the best, producing an MAE of $108K and 0.745 Adjusted R-Squared. Model #8 represented a cumulative improvement of $38K in MAE and 0.174 in Adjusted R-Squared versus Model #2.

One challenge in describing this multiple linear regression model to the business is the fact that we have 10 features and use several log transformations. This makes interpretability difficult. The following visualization filters 7 of the predictors to be "average", and then plots this subset of predictions (244 test observations) to show the relationship between Price and Living Square Feet.

Diving into a single prediction can describe the relative impact of features. Follow the circled point above to show the composition of this individual prediction.

Given the log nature and categorical variables, the relative importance of features will vary with each home. But at least for this home we can assess the following:

  • The model constant, living square footage and price per square foot in their zipcode drove 85% of the predicted price.
  • Other features grade, square footage per bedroom, condition and miles from Seattle’s city center contributed the other 15%.

This particular home "Rainy Paradise" did not have four of the features in our model (all were 0 based on the data for this home sale).

  • Waterfront: If true, would have added +$189K to predicted price
  • Renovation: If true, would have added $54K to predicted price
  • Basement: If true, would have added $9K to predicted price
  • Grade 10 (vs 7): If true, would have added $240K to predicted price

Hopefully this analysis makes the model a little more real and informs on the relative feature importance and sensitivity.

Final Thoughts

Thanks for sticking with me through this lengthy post! I wanted to lay out an end-to-end multiple linear regression workflow in one place. Model #8, which I’m recommending, is good but I was disappointed with having a six-figure MAE. If I had more time, I would continue to optimize these models looking first at these areas:

  • Take a deeper look at outliers, or perhaps split out a luxury model versus a core model.
  • Use the latitude and longitude to try to isolate neighborhood and generate median price-per-square foot values at the neighborhood level versus zipcode.
  • Look at polynomial and interaction terms in greater detail.

Please do leave your comments or ideas!


Join Medium with my referral link – Chuck Utterback


Related Articles