The world’s leading publication for data science, AI, and ML professionals.

Towards Machine Learning – Linear Regression and Polynomial Regression

What approach is better to predict CO2 output of your car? Linear or Polynomial regression?

Towards Machine Learning – Linear Regression

Photo by Ryan Stone on Unsplash
Photo by Ryan Stone on Unsplash

Recently I started a course on Applied Machine Learning by the University of Michigan on Coursera. The course covers some widely popular Machine Learning algorithms. I decided to write few brief articles regarding this topic, which are intended to help people new to this topic dive in the interesting world of Machine Learning. My last article covered the topic of K Nearest Neighbours (KNN) classification. You can take a look how to use KNN to classify cars into vehicle classes according to their engine size, cylinder count, fuel consumption and CO2 output. Today we go further, and tackle Linear Regression, another extremely popular and wide used technique.


Structure of the article:

  • Introduction
  • Dataset loading and description
  • Data analysis
  • Model training and evaluation
  • BONUS: Polynomial Regression
  • Conclusion

Enjoy the reading! 🙂


Introduction

When observing the fuel economy data in the last article, some of the features (like engine displacement and CO2 output) showed an interesting pattern. An increase in engine displacement ("bigger" engine) resulted in higher CO2 output. Let’s take a look in this simple scatter plot of the data. One would say the data shows somewhat of a linear relationship.

Some linearity in data distribution can be observed (source: author)
Some linearity in data distribution can be observed (source: author)

Wouldn’t it be interesting to be able to predict how much CO2 will produce your car according to the displacement? Therefore, todays article will take a look in Linear Regression and how to apply it on a set of data using Python and the Scikit-Learn library.

What is Linear Regression, and why is it so popular?

Well, first and foremost it’s a very SIMPLE algorithm. It attempts to describe a relationship between two variables with a straight line (or a linear equation). If we again look at the above scatterplot, we can see that CO2 is the dependent variable, of the explanatory variable Displacement. Such plots are often used as first step before investigating a relationships between two variables. So, the approach assumes that every value of Y (CO2) can be described as a linear function of X (Displacement) following the simple equation:

Y=w*X + b

Where w is the slope of the line, and b is the y-axis intercept. Or in Machine Learning terms, more often w is referred as weight and b stands for bias.


However, some assumption have to be taken into account:

  1. Linear relationship: A linear relationship between x and y has to exist.
  2. Independence: no correlation between consecutive residuals in time series data
  3. Homoscedasticity: constant variance of residuals for every x
  4. Normality: the residual are normally distributed

What are these so called residuals? Well, a residual is the value (or deviation) of the observed value from the fitted line. Basically, it is the distance (red line) of the blue point from the fitted line (green line).

A Residual is the distance from the fitted line to the observed value (source: author)
A Residual is the distance from the fitted line to the observed value (source: author)

Least-squares error

By now, you probably ask yourself, how we calculate the w and b parameters? Well, we need to calculate the "loss" of the function, which is the sum of all the residuals. Here the Least-squares error comes into play. It’s a common approach where the squared difference from the observed point (red circle on the plot) to the fitted line (green line) gets calculated, all these squared distances get summarized. The goal of the approach is to find the w and b that yield the lowest possible sum of squared distances. The sum of squared distances is often called the "Mean Squared Error" or MSE.

Let’s jump to a more practical example, on our car fuel consumption data. 🙂


Dataset loading and description

As always, first we import the dependencies, load the data and take a look at the dataframe.

Slice of the dataframe (source: author)
Slice of the dataframe (source: author)

Again, we will use the fuel economy dataset from Udacity. It contains technical specifications from 3920 cars, with data on cylinder count, engine size (displacement), fuel consumption, CO2 output etc. We are interested only in engine size and CO2 output.


Data analysis

We select and reshape the explanatory variable, due to Scikit-learn requirements of a 2D array, and since we only have 1 input feature, .reshape(-1, 1) is used. Then the data is split into training and test samples, using Scikit-learn’s train_test_split function.

Train/test split (source: author)
Train/test split (source: author)

Model training and evaluation

In the next step we prepare our model. Therefore, the LinearRegression function is used, and fitted with the training data to train the model.

We can check the w and b parameters of the regression function by calling the .coef_ and .intercept_ attributes. Also, the R-squared (coefficient of determination) is calculated by calling .score, similar as for KNN.

Results of the model (source: author)
Results of the model (source: author)

Now, let us plot the resulting linear regression on train data.

Linear Regression - Training data (source: author)
Linear Regression – Training data (source: author)

Before we can plot the regression of test date, first we calculate the predicted Y values according to our model, we do this as follows:

Linear Regression - Test data (source: author)
Linear Regression – Test data (source: author)

As we can observe from train and test data, our model is very good at predicting CO2 for cars which are average in terms of displacement, cars with 3–4 litres engines. The model shows weaker performance for so called outliers, points with greater deviation from the mean value of the observed class. I.e. very frugal cars, like the hybrid ones with very low emissions, or in opposite the cars with very high power output. Also, for cars with very small engines, the model tends to significantly overestimate the CO2 output.


BONUS: Polynomial Regression

In simple terms, Polynomial Regression is a form of regression analysis where for modeling data relationships a n-th order polynomial instead of a linear model is used. This approach adds complexity to the model, and can yield better results when more complex data-relationships between the explanatory and dependent variables exist. I.e. in our case, at first glance the relationship between displacement and CO2 output seem to fit a linear trend, but we ask ourselves, could maybe a quadratic or even a higher order function better capture the relationship?

Let’s find out! 🙂

To apply Polynomial Regression with Scikit-Learn, we will use the PolynomialFeatures class from the pre-processing module. Basically, it generates polynomial features which are then used in least-squares linear regression approach. As we assume a quadratic relationship, we set the degree of the polynomial to 2.

We train the model.

Next, we plot the results, compared to previously described linear Regression model.

Polynomial vs. Linear Regression (source: author)
Polynomial vs. Linear Regression (source: author)

Applying second order (k=2) polynomial regression, a quadratic function, yields marginally better results on our training and test dataset.

Results of Polynomial Regression (k=2) (source: author)
Results of Polynomial Regression (k=2) (source: author)

In the final step we test the model accuracy for different order of the polynomial.

Accuracy vs. computing time (source: author)
Accuracy vs. computing time (source: author)

As seen from the above plot, the accuracy gain when increasing K further, yields very marginal increase in accuracy, but the computing time increase is drastic! Also, increasing k over 13 starts to decrease the accuracy, due to very high model complexity and overfitting.


Conclusion

In the article a simple Linear Regression example is presented. One of the main advantages of Linear Regression is trend description, more stable predictions, but often inaccurate in case of outliers or extreme values (minimum or maximum). Polynomial Regression enhances the performance, but with adding to much model complexity tends to overfitting and significant increase in computing time.


I hope everything was clearly explained and presented. Until next time, feel free to check my other articles. 🙂

Towards machine learning – K Nearest Neighbour (KNN)

Measuring and Calculating Streamflow at the Plitvice Lakes National Park

Every comment or suggestion is welcome! LinkedIn

Feel free to follow my stories on medium. Cheers!


Related Articles