The world’s leading publication for data science, AI, and ML professionals.

Linear Regression – Occam’s Razor of Predictive Machine Learning Modeling

Machine learning modeling using linear regression in Python

crystal ball, by DALL.E 2
crystal ball, by DALL.E 2

Are you familiar with Occam’s Razor? I remember a mention of it in the Big Bang Theory TV series! The idea behind Occam’s Razor is that all other things being equal, the simplest explanation for a phenomenon is more likely to be true than a more complex one (i.e. the simplest solution is almost always the best solution). I’d like to think that Occam’s Razor of predictive modeling in machine learning is Linear Regression, which is almost the simplest modeling methodology to use and can be the best solution for certain tasks. This post will cover an introduction and implementation of linear regression.

Similar to my other posts, learning will be achieved through practice questions and answers. I will include hints and explanations in the questions as needed to make the journey easier. Lastly, the notebook that I used to create this exercise is also linked in the bottom of the post, which you can download, run and follow along.

Let’s get started!

(All images, unless otherwise noted, are by the author.)

Join Medium with my referral link – Farzad Mahmoodinobar


Data Set

In order to practice linear regression, we will use a data set of car prices from UCI Machine Learning Repository (CC BY 4.0). I have cleaned up parts of the data for our use and it can be downloaded from this link.

I will explain some of the math behind the linear regression model we will be using in the exercise. Understanding the math is not required to be able to successfully understand the content of this post but I do recommend going through it to get a better sense of what is happening behind the scene, when we create a linear regression model.


Fundamentals of Linear Regression

Linear regression is when linear predictors (or independent variables) are used to predict a dependent variable. One simple example is the formula for a line:

In this case, y is the dependent variable and x is the independent variable (c is a constant). The goal of a linear regression model is to determine the best coefficient (a in the example above) for x to most accurately predict y.

Now let’s generalize that example, which is also known as mulitple linear regression. In a multiple linear regression model, the goal is to find the line of best fit that describes the relationship between the dependent variable and multiple independent variables.

In this case, we have multiple independent variables (or predictors) from x_1 to x_n and each one is multiplied by its own coefficient to predict the dependent variable y. In a linear regression model, we will try to determine the values of coefficients a_1 to a_n to have the best prediction for the dependent variable y.

Now that we understand what a linear regression is, let’s move to Ordinary Least Squares (OLS) regression, which is a form of linear regression.


Ordinary Least Squares Regression

Ordinary least squares regression model estimates the coefficients of a regression model by minimizing the sum of the squares of residuals. Residual is the vertical distance between the line (i.e. predicted values) vs. the actuals, as shown in the figure below. These residuals are squared so that errors do not cancel each other out (when one prediction is higher than actual and another prediction is lower than actual, these two are still errors and should not cancel each other out).

Ordinary Least Squares Regression - Regression Line and Residuals
Ordinary Least Squares Regression – Regression Line and Residuals

Now that we understand the underlying concepts, we will start with exploring the data and the variables (or features) that we may be able to use to predict the car prices. Then we will split the data into train and test sets to build the regression model. We will then look at the performance of the regression model and finally will plot the results.

Let’s get started!


1. Exploratory Analysis

Let’s start by looking at the data, which can also be downloaded from here. First we will import Pandas and NumPy. Then we will read the CSV file including our data set and look at the top five rows of the data set.

# Import libraries
import pandas as pd
import numpy as np

# Show all columns/rows of the dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# To show all columns in one view
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Read the csv into a dataframe
df = pd.read_csv('auto-cleaned.csv')

# Display top five rows
df.head()

Results:

Column names are mostly self-explanatory so I will add only the ones that were not immediately obvious to me. You can ignore these for now and just refer to them during the course of the exercise if you need the definition of a column name.

  • symboling: A value assigned by insurance companies according to the car’s perceived riskiness. A value of +3 indicates that the car is risky, -3 indicates that it is safe
  • aspiration: Standard or turbo
  • drive-wheels: rwd for rear-wheel drive; fwd for front-wheel drive; 4wd for four-wheel drive
  • wheel-base: The distance between the centres of the front and rear wheels in centimeters
  • engine-type: dohc for Dual OverHead Cam; dohcv for Dual OverHead Cam and Valve; l for L engine; ohc for OverHead Cam; ohcf for OverHead Cam and Valve F engine; ohcv for OverHead Cam and Valve; rotor for Rotary engine
  • bore: Inner diameter of the cylinder in centimeters
  • stroke: Movement of the cylinder

Question 1:

Are there any missing values in the dataframe?

Answer:

df.info()

Results:

As we can see, there are 25 columns (note column numbers start from 0 to 24) and 193 rows. There are no null values in the columns.


2. Feature Selection

Feature selection is the process of identifying and selecting a subset of relevant features (also known as "predictors," "inputs," or "attributes") for building a Machine Learning model. The goal of feature selection is to improve the model’s accuracy and interpretability by reducing the complexity of the model and eliminating irrelevant, redundant, or noisy features.

Question 2:

Create a table showing the correlation among the columns in the dataframe.

Answer:

We are going to use pandas.DataFrame.corr, which calculates pairwise correlation of columns. There are two points to consider:

  1. pandas.DataFrame.corr will exclude null values. We confirmed our data set does not include any null values but this might be important in exercises with null values.
  2. We will be limiting the correlation to numerical values only and will discuss categorical values later in the exercise.

As a refresher, let’s review what categorical and numerical variables are before we continue.

In machine learning, categorical variables are variables that can take on a limited number of values. These values represent different categories and the values themselves have no inherent order or numerical meaning. Examples of categorical variables include gender (male or female), marital status (married, single, divorced, etc.).

Numeric variables are variables that can take on any numerical value within a certain range. These variables can be either continuous (meaning they can take on any value within a certain range) or discrete (meaning they can only take on specific, predetermined values). Examples of numeric variables include age, height, weight, etc.

With those out of the way, let’s calculate the correlations.

corr = np.round(df.corr(numeric_only = True), 2)
corr

Results:

Question 3:

There are a lot of correlation values generated in the last question. We care more about the correlation with the car prices. Show the correlation with the car prices and order that from the largest to the smallest.

Answer:

price_corr = corr['price'].sort_values(ascending = False)
price_corr

Results:

This is quite interesting. For example, "engine-size" seems to have the highest correlation with the price, which is expected, while "compression-ratio" does not seem to be as highly correlated with the price. On the other hand, "symboling", which we recall is a measure of riskiness of the car, is negatively-correlated with the car price, which again makes intuitive sense.

Question 4:

In order to focus more on the relevant features to build a car price model, filter out columns that have a weaker correlation with price, which we are going to define as any feature with correlation less than an absolute value of 0.2 (this is an arbitrarily-selected value for this exercise).

Answer:

# Set the threshold
threshold = 0.2

# Drop columns with a correlation less than the threshold
df.drop(price_corr.where(lambda x: abs(x) < threshold).dropna().index, axis = 1, inplace = True)

df.info()

Results:

We see that as a result of this, we are now left with 19 features (there are 20 columns but one of them is the price itself so there are 19 features or predictors).

Question 5:

Now that we have a more manageable number of features, take another look at them and see if we need to drop any of them.

Hint: Some features might be very similar and arguably redundant. And some might not really matter.

Answer:

Let’s look at the dataframe and then at the correlation among the features left.

df.head()

Results:

# Calculate correlations
round(df.corr(numeric_only = True), 2)

Results:

"wheel-base" (distance between the front and rear wheels) and "length" (total lenght of the car) are highly-correlated and seem to convey the same information. Additionally, "city-mpg" and "highway-mpg" are highly-correlated, so we can consider dropping one of them. Let’s go ahead and drop "wheel-base" and "city-mpg" and then look at the top five rows of the dataframe again.

# Drop the columns
df.drop(['wheel-base', 'city-mpg'], axis = 1, inplace = True)

# Return top five rows of the remaining dataframe
df.head()

Results:

As we see above, the new dataframe is smaller and does not include the two columns that we just removed. Next, we will talk about categorical variables.

2.1. Dummy Coding

Let’s look more closely at the values of columns "make" and "fuel-type".

df['make'].value_counts()

Results:

df['fuel-type'].value_counts()

Results:

These two columns are categorical values (e.g. Toyota or Diesel), and not numerical.

In order to include these categorical variables in our regression model, we are going to create "dummy codes" for these categorical variables.

Dummy coding is where the categorical variables (or predictors) in one column, are replaced by multiple binary columns. For example, let’s assume we had a categorical variable as shown in this table:

Categorical Variable - Before Dummy Coding
Categorical Variable – Before Dummy Coding

As we see in the above table, "random_categorical_variable" can have three categorical values of A, B and C. We would like to transform the categorical variable into a format that we can more easily use in our regression model using dummy coding, which will transform it into three separate columns of A, B and C, with binary values, as follows:

Categorical Variable - After Dummy Coding
Categorical Variable – After Dummy Coding

Let’s see how dummy coding can be implemented in Python.

Question 6:

Dummy code the categorical columns of our dataframe.

Answer:

Let’s first look at what the dataframe looks like before dummy coding.

df.head()

Results:

We know from the previous question that column "fuel-type" can take 2 distinct values (i.e. gas and diesel). Therefore, after dummy coding, we expect to replace "fuel-type" column with 2 separate columns. The same applies to other categorical columns, depending on how many unique values each has.

Let’s first only dummy code the "fuel-type" column as an example and look at how the dataframe changes, then we can go ahead and dummy code other categorical columns.

# Dummy code df['fuel-type']
df = pd.get_dummies(df, columns = ['fuel-type'], prefix = 'fuel-type')

# Return top five rows of the updated dataframe
df.head()

Results:

As expected, we now have 2 columns for the original "fuel-type", named "fuel-type_gas" and "fuel-type_diesel".

Next, let’s identify all the categorical columns and dummy code them.

# Select "object" data types
columns = df.select_dtypes(include='object').columns

# Dummy code categorical columns
for column in columns:
    df = pd.get_dummies(df, columns = [column], prefix = column)

# Return top five rows of the resulting dataframe
df.head()

Results:

Note that the above snapshot does not cover all the columns after dummy coding, since now we have 63 columns, which would become too small to demonstrate in a snapshot.

Lastly and now that we have created all these new columns, let’s recreate the correlation between price and all the other columns and sort them from the highest to the lowest.

# Re-create the correlation matrix
corr = np.round(df.corr(numeric_only = True), 2)

# Return correlation with price from highest to the lowest
price_corr = corr['price'].sort_values(ascending = False)
price_corr

Results:

As we see above, some of the categorical variables have a high correlation with the price such as "drive-wheels" and "num-of-cylinders".

At this point, we have familiarized ourselves with the data and cleaned up the data to a certain extent, now let’s continue with the main goal of creating a model to predict the price of the car, based on these attributes.


3. Splitting the Data Into Train and Test Sets

At this point, we are going to first break down the data into dependent and independent variables. Dependent variable or "y" is what we are going to predict, which is "price" in this exercise. It is called the dependent variable because its value depends on the values of the independent variables. Independent variables or "X" are all other variables or features that we have left in our data frame at this point, which includes "engine-size", "horsepower", etc.

Next, we will break down the data into Train and Test sets. As the names suggest, Train data set will be used to train our regression model and then we will test the performance of the model using the Test set. We split the data to ensure that model does not see the Test set during its training process so that the Test set can be a good representative of how well the model performs. It is important to split the data into a training set and a test set because using the same data to fit the model and evaluate its performance can lead to overfitting. Overfitting occurs when the model is too complex and has learned the noise and random fluctuations in the data, rather than the underlying pattern. As a result, the model may perform well on the training data but poorly on new, unseen data.

Question 7:

Assign the dependent variable (target) to y and the independent variables (or features) to X.

Answer:

X = df.drop(['price'], axis = 1)
y = df['price']

Question 8:

Break down the data into a train and test set. Use 30% of the data for the test set, and use a random_state of 1234.

Answer:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1234)

Question 9:

Train a linear regression model using the training set.

Answer:

from sklearn.linear_model import LinearRegression

# First create an object of the class
lr = LinearRegression()

# Now use the object to train the model
lr.fit(X_train, y_train)

# Let's look at the coefficients of the trained model
lr.coef_

Results:

We will discuss what happened here, but let’s first look at how we can evaluate a machine learning model.


4. Model Evaluation

4.1. R²

Question 10:

What is the score of the trained model?

Answer:

For this purpose, we can use LinearRegression’s "score()", that returns the coefficient of determination of the prediction, or R²$, which is calculated as follows:

The best possible score is 1.0. A constant model that always predicts the expected value of "y", regardless of the input features, would get R² score of 0.0.

With that knowledge, let’s look at the implementation.

score = lr.score(X_train, y_train)

print(f"Training score of the model is {score}.")

Results:

Question 11:

Predict the values of the test set and then evaluate the performance of the trained model on the test set.

Answer:

# Predict y for X_test
y_pred = lr.predict(X_test)

score_test = lr.score(X_test, y_test)
print(f"Test score of the trained model is {score_test}.")

Results:

4.2. Mean Squared Error

Mean Squared Error (MSE) is the average of the squared errors and is calculated as follows:

Question 12:

Calculate the Mean Squared Error and R² for the predicted results of the test sets.

Answer:

from sklearn.metrics import mean_squared_error, r2_score

print(f"R^2: {r2_score(y_pred, y_test)}")
print(f"MSE: {mean_squared_error(y_pred, y_test)}")

Results:

Question 13:

How do you interpret the results of the previous question? What are your recommendations for the next steps?

Answer:

R² is relatively high but the MSE is pretty high too, which can suggest the error may be too high – note this really depends on the business needs and what this model is being used for. There might be a case where R² of 90.6% is good enough for the business needs and there might be cases where this number is just not good enough. This performance level can be driven by some features that are not strong predictors of price. Let’s see if we can identify which ones are not strong predictors and eliminate them. Then we can retrain and look at the scores again to see if we were able to make improvements to our model.

For this step and in order to try something new, we are going to use the ordinary least squares (OLS) from statsmodels library. The steps of training and then predicting the values of the test set is the same as before.

# Import libraries
import statsmodels.api as sm

# Initialize the model
sm_model = sm.OLS(y_train, X_train).fit()

# Create the predictions
sm_predictions = sm_model.predict(X_test)

# Return the summary results
sm_model.summary()

Results:

This one provides a nice presentation of the features and a measurement of p-value for that specific feature’s significance. For example, if we use a 0.05 or 5% significance level (or 95% confidence level), we can eliminate the features where "P > |t|" is larger than 0.05.

len(sm_model.pvalues.where(lambda x: x > 0.05).dropna().index)

Results:

43

There are 43 such columns. Let’s drop these columns and see if the results improve.

# Create a list of columns that meee the criteria
columns = list(sm_model.pvalues.where(lambda x: x > 0.05).dropna().index)

# Drop those columns
df.drop(columns, axis = 1, inplace = True)

# Revisit the process to create a new model summary
X = df.drop(['price'], axis = 1)
y = df['price']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1234)

# Train the model
sm_model = sm.OLS(y_train, X_train).fit()

# Create predictions using the trained model
sm_predictions = sm_model.predict(X_test)

# Return model summary
sm_model.summary()

Results:

The overall performance, as judged by the R-squared value, improved from 0.967 to 0.972 and we reduced the number of columns, which makes our model and analysis more efficient.


5. Plot of the Predictions vs. Actuals

Question 14:

Create a scatter plot of predictions vs. actuals. We would expect all the points to lay across a straight line such as f(x) = x if all the predictions matched the actuals. Add such a straight line in red for comparison.

Answer:

# Import libraries
import matplotlib.pyplot as plt
%matplotlib inline

# Define figure size
plt.figure(figsize = (7, 7))

# Create the scatterplot
plt.scatter(y_pred, y_test)
plt.plot([y_pred.min(), y_pred.max()], [y_pred.min(), y_pred.max()], color = 'r')

# Add x and y labels
plt.xlabel("Predictions")
plt.ylabel("Actuals")

# Add title
plt.title("Predictions vs. Actuals")
plt.show()

Results:

Scatter Plot of the Trained Model's Predictions vs. Actuals
Scatter Plot of the Trained Model’s Predictions vs. Actuals

As we expected, the values are scattered around the straight line, demonstrating a good level of prediction generatd by the model. Where the dots are to the right side of the red line, it means that the model predicted a larger price compared to the actual, while the dots in the left side of the red line indicate the reverse.


Notebook with Practice Questions

Below is the notebook with both questions and answers that you can download and practice.


Conclusion

In this post, we talked about how in some cases the simplest solution can be the most appropriate solution and introduced and implemented Linear Regression as such a solution in predictive machine learning tasks. We started by learning about the math behind linear regression and then implemented a model to predict car prices based on existing car attributes. We then measured the model’s performance and took certain measures to improve our model’s performance and finally visualized the comparison of the trained model’s predictions to the actuals using a scatterplot.


Thanks for Reading!

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!


Related Articles