The world’s leading publication for data science, AI, and ML professionals.

Baby Steps Towards Data Science: Multiple Linear Regression in Python

How to implement multiple linear regression and interpret the results. Source code and interesting basketball player dataset has been…

Image Source: Cross Validated - Stack Exchange
Image Source: Cross Validated – Stack Exchange

What is Multiple Linear Regression?

Let me get right into the subject. The picture you see above is the mathematical representation of Multiple Linear Regression. All the necessary explanation is given in the image.

As the name suggests MLR (Multiple Linear Regression) is linear combination of multiple features/variables that define the average behavior of the dependent variable.

Consider x1,x2,..xp as the independent variables and Y is the dependent variable. All the beta values correspond to the coefficients for respective independent variables. Beta0 on the other hand is the intercept values which is similar to Simple Linear Regression.

What’s the error term ?

It is an error that is there in the nature. Remember in my previous article I specified one can never predict the exact future value ? that is due to the fact that this error is present. Error consists of all the data that is not recorded/ used in our model. Such as emotions, feelings etc that can not be easily quantifiable or simply the lack of data/human errors in recording the data.

But don’t worry about it. Once we use this regression method we only get the average behavior of the Y variable. This average behavior when compared to the actual data, might be greater than, less than or equal to the original predictions of Y. Since we deal with only the average of Y the error terms cancel out each other and we have an estimated regression function in the end with no error term.

How to decrease error ? Simple, invest more money and find more data.

Example:

Let us consider a firm’s profit as the dependent variable (Y) and it’s spending in RnD (x1), Advertising (x2) and Marketing (x3) be our independent variables.Let’s say after doing all the math we come up with the below regression equation or function [the other name for the mathematical representation of any regression]. Please bear in mind that the below function is hypothetical.

Profit = 10000 + 5(RnD) - 2(Advertising) + 1(Marketing)

Interpretations:

  1. If the firm doesn’t invest in RnD, Advertising and Marketing, then their average profit would be $10,000
  2. If the firm increases RnD spends by $100K, the average profit increases by $500K/5 units[Here the units of the variables is very important. Based on the units you can rightly make the interpretations [All variables are in the units of $100k], this makes sense because, more investment in RnD better products come into the market, thus better sales.
  3. If the firm increases Advertising spending by $100k, then the average profit decreases by $200K/2 units. More expenditure on advertisements might decrease the overall profits.
  4. If the firm increases Marketing spending by $100k, then the average profit increases by $100K/1 unit. More expenditure on marketing might increase its popularity and thus the profits.

How are these estimates calculated ?

There is a mathematical method called OLS (Ordinary Least Squares) method. Using certain matrix transformations you can find the estimated coefficient values. Explaining OLS is not in the scope of this article. You can easily find tutorials online regarding the same, kindly go through them if you really want to know how it works. However, modern programming languages will help you in computing those estimates for you.

Implementation in Python

Let us deep dive into python and build a MLR model and try to predict the points scored by basketball players.

Photo by Andrey Krasilnikov on Unsplash
Photo by Andrey Krasilnikov on Unsplash

Before you move forward, please download the csv data file from my GitHub Gist.

https://gist.github.com/tharunpeddisetty/bc2508ab97c7c8460852f87ba4ad4efb
Once you open the link, you can find "Download Zip" button on the top right corner of the window. Go ahead and download the files.
You can download 1) python file 2)data file (.csv)
Rename the folder accordingly and store it in desired location and you are all set.If you are a beginner I highly recommend you to open your python IDE and follow the steps below because here, I write detailed comments(statements after #.., these do not compile when our run the code) on the working of code. You can use the actual python as your backup file or for your future reference.

Importing Libraries

import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd

Import Data and Define the X and Y variables

dataset = pd.read_csv('/Users/tharunpeddisetty/Desktop/PlayerStatsBasketball.csv') #paste your path inside the quatations 
#iloc takes the values from the specified index locations and stores them in the assigned variable as an array
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values

Let us look at our data and understand the variables:

Image by Author
Image by Author

All the variable names are self explanatory. In this program our task is to predict the average points scored by a player based on his height, weight, Field Goals percent and Free throw percent. Therefore, you can see that I have isolated last column to be our Y variable and all other variables into X as array of features using iloc function in python and some inded slicing.

Splitting data into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=0)
# test_size =0.2 tells the class that we are trying to split 80% of the data into train data and 20% into the test data. random_state=1 serves as a seed, this basically makes sure that same random process is chosen when splitting the data in different evironments. To be specific, you and I running this code in 2 different computers get same rows in both test and train data sets

ScikitLearn is a well known python library extensively used for Machine Learning. You can check out all it’s awesome APIs for your reference. We are going to use them very often. Follow the below link

https://scikit-learn.org/stable/modules/classes.html

Training the model

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_train)
#regressor is the object for class LinearRegression

Predicting on test set

# Predicting new data i.e., test  data resuls using the model that is trained on our train data
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1),Y_test.reshape(len(y_pred),1)),1))
#.reshape is to display the vector vertical instead of default horizontal. Second parameter axis =0 is vertical concatenation, axis =1 is horizontal concatenation

Results

Image by Author
Image by Author

These results are in line with our last print statement. The first column of values are the average goals predicted by our model and the 2nd column shows the actual goals scored by the players.

Printing coefficient and intercept values

print(regressor.coef_)
print(regressor.intercept_)

I believe you can print these out and do your interpretation based on the example I gave before.

Conclusion

Well, as you can see that our model didn’t perform as good as expected. There’s just one value i.e., 7.55 that is quite close to the actual value of 7.9. There is a reason I chose this dataset to explain how multiple linear regression works, that is to say that no model is perfect and no one can build a perfect model. You must know the reasons behind why it isn’t performing up to the mark. Our data might have certain issues such as:

  1. There might be potential problems with the dataset.
  2. Probably, we don’t have enough training data for the model to process and predict.
  3. There might be inconsistencies in the data such as entering wrong values (due to human or computer errors).
  4. Independent variables considered might not be the significant variables that explain the average goals in a game by a player.

All the above points coincide at one point: ERROR. We might say that our data has a lot of error. We perhaps need more data regarding the players’ fitness, motivation, mood etc during the matches.

So does that mean we failed to implement Multiple Linear Regression ? Certainly not, because as a Data Scientist, one must understand that having the ability to write code and implementing models isn’t the primary aim. He/she must gain an in-depth knowledge on data. How do you do that ? Just talk to the ones who provided the data and may be also converse with industry experts in the field where the data originated from (basketball in our case) and try to find out if there are any potential errors in the data.

Building models isn’t that simple unless you understand the data. One must know what their objective is. There’s no perfect model that can predict with 95% accuracy all the time, with different datasets. If anybody claims that their model is the best, that person is a fool.

Well, my purpose is solved. I’ve shown how to model a Multiple Linear Regression and also showed you what to carefully consider before jumping into regression modeling. One has to change necessary parameters so that the model optimally works for the specific purpose of the business/individual. Try the same approach and code with different data sets, identify the X and Y variables and check the results, you can get tons of datasets online. Same code worked wonderfully for various other datasets, so it isn’t really useless after all. Happy Machine Learning!


Related Articles