Simple and multiple linear regression with Python

Published in

Towards Data Science

11 min readJul 27, 2019

Linear regression is an approach to model the relationship between a single dependent variable (target variable) and one (simple regression) or more (multiple regression) independent variables. The linear regression model assumes a linear relationship between the input and output variables. If this relationship is present, we can estimate the coefficients required by the model to make predictions on new data.

In this article, you will learn how to visualize and implement the linear regression algorithm from scratch in Python using multiple libraries such as Pandas, Numpy, Scikit-Learn, and Scipy. Additionally, we will measure the direction and strength of the linear relationship between two variables using the Pearson correlation coefficient as well as the predictive precision of the linear regression model using evaluation metrics such as the mean square error.

Now! Let’s get started 💜

Analysis of the dataset

The dataset used in this article was obtained in Kaggle. Kaggle is an online community of data scientists and machine learners where it can be found a wide variety of datasets. The dataset selected contains the height and weight of 5000 males and 5000 females, and it can be downloaded at the following link:

Kaggle: Your Home for Data Science

Edit description

www.kaggle.com

The first step is to import the dataset using Pandas. Pandas is a Python open source library for data science that allows us to work easily with structured data, such as csv files, SQL tables, or Excel spreadsheets. After importing csv file, we can print the first five rows of our dataset, the data types of each column as well as the number of null values.

As we can easily observe, the dataframe contains three columns: Gender, Height, and Weight. The Gender column contains two unique values of type object: male or female. A float data type is used in the columns Height and Weight. Since the dataframe does not contain null values and the data types are the expected ones, it is not necessary to clean the data .

To better understand the distribution of the variables Height and Weight, we can simply plot both variables using histograms. Histograms are plots that show the distribution of a numeric variable, grouping data into bins. The height of the bar represents the number of observations per bin.

The previous plots depict that both variables Height and Weight present a normal distribution. It can also be interesting as part of our exploratory analysis to plot the distribution of males and females in separated histograms.

The previous plots show that both height and weight present a normal distribution for males and females. Although the average of both distribution is larger for males, the spread of the distributions is similar for both genders. Pandas provides a method called describe that generates descriptive statistics of a dataset (central tendency, dispersion and shape).

Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumption before we perform further evaluations. After performing the exploratory analysis, we can conclude that height and weight are normal distributed. Males distributions present larger average values, but the spread of distributions compared to female distributions is really similar.

But maybe at this point you ask yourself: There is a relation between height and weight? Can I use the height of a person to predict his weight?

The answer of both question is YES! 😃 💪 Let’s continue ▶️ ▶️

Scatter plots with Matplotlib and linear regression with Numpy

A scatter plot is a two dimensional data visualization that shows the relationship between two numerical variables — one plotted along the x-axis and the other plotted along the y-axis. Matplotlib is a Python 2D plotting library that contains a built-in function to create scatter plots the matplotlib.pyplot.scatter() function.

The following plot shows the relation between height and weight for males and females. The visualization contains 10000 observations that is why we observe overplotting. Overplotting occurs when the data overlap in a visualization, making difficult to visualize individual data points. In this case, the cause is the large number of data points (5000 males and 5000 females). Another reason can be a small number of unique values; for instance, when one of the variables of the scatter plot is a discrete variable.

In the following plot, we have randomly selected the height and weight of 500 women. This plot has not overplotting and we can better distinguish individual data points. As we can observe in previous plots, weight of males and females tents to go up as height goes up, showing in both cases a linear relation.

Simple linear regression is a linear approach to modeling the relationship between a dependent variable and an independent variable, obtaining a line that best fits the data.

y =a+bx

where x is the independent variable (height), y is the dependent variable (weight), b is the slope, and a is the intercept. The intercept represents the value of y when x is 0 and the slope indicates the steepness of the line. The objective is to obtain the line that best fits our data (the line that minimize the sum of square errors). The error is the difference between the real value y and the predicted value y_hat, which is the value obtained using the calculated linear equation.

error = y(real)-y(predicted) = y(real)-(a+bx)

We can easily obtain this line using Numpy. Numpy is a python package for scientific computing that provides high-performance multidimensional arrays objects. The numpy function polyfit numpy.polyfit(x,y,deg) fits a polynomial of degree deg to points (x, y), returning the polynomial coefficients that minimize the square error. In the following lines of code, we obtain the polynomials to predict the weight for females and males.

The following plot depicts the scatter plots as well as the previous regression lines.

Scatter plots and linear regression line with seaborn

Seaborn is a Python data visualization library based on matplotlib. We can easily create regression plots with seaborn using the seaborn.regplot function. The number of lines needed is much lower in comparison to the previous approach.

The previous plot presents overplotting as 10000 samples are plotted. The plot shows a positive linear relation between height and weight for males and females. For a better visualization, the following figure shows a regression plot of 300 randomly selected samples.

Fitting a simple linear model using sklearn

Scikit-learn is a free machine learning library for python. We can easily implement linear regression with Scikit-learn using the LinearRegression class. After creating a linear regression object, we can obtain the line that best fits our data by calling the fit method.

The values obtained using Sklearn linear regression match with those previously obtained using Numpy polyfit function as both methods calculate the line that minimize the square error. As previously mentioned, the error is the difference between the actual value of the dependent variable and the value predicted by the model. The least square error finds the optimal parameter values by minimizing the sum S of squared errors.

Once we have fitted the model, we can make predictions using the predict method. We can also make predictions with the polynomial calculated in Numpy by employing the polyval function. The predictions obtained using Scikit Learn and Numpy are the same as both methods use the same approach to calculate the fitting line.

Pearson correlation coefficient

Correlation measures the extent to which two variables are related. The Pearson correlation coefficient is used to measure the strength and direction of the linear relationship between two variables. This coefficient is calculated by dividing the covariance of the variables by the product of their standard deviations and has a value between +1 and -1, where 1 is a perfect positive linear correlation, 0 is no linear correlation, and −1 is a perfect negative linear correlation.

We can obtain the correlation coefficients of the variables of a dataframe by using the .corr() method. By default, Pearson correlation coefficient is calculated; however, other correlation coefficients can be computed such as, Kendall or Spearman.

A rule of thumb for interpreting the size of the correlation coefficient is the following:

1–0.8 → Very strong
0.799–0.6 → Strong
0.599–0.4 → Moderate
0.399–0.2 → Weak
0.199–0 → Very Weak

In previous calculations, we have obtained a Pearson correlation coefficient larger than 0.8, meaning that height and weight are strongly correlated for both males and females.

We can also calculate the Pearson correlation coefficient using the stats package of Scipy. The function scipy.stats.pearsonr(x, y) returns two values the Pearson correlation coefficient and the p-value.

As can be observed, the correlation coefficients using Pandas and Scipy are the same:

Females correlation coefficient: 0.849608
Males correlation coefficient: 0.8629788

Residual plots

We can use numerical values such as the Pearson correlation coefficient or visualization tools such as the scatter plot to evaluate whether or not linear regression is appropriate to predict the data. Another way to perform this evaluation is by using residual plots. Residual plots show the difference between actual and predicted values. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

We can use Seaborn to create residual plots as follows:

As we can see, the points are randomly distributed around 0, meaning linear regression is an appropriate model to predict our data. If the residual plot presents a curvature, the linear assumption is incorrect. In this case, a non-linear function will be more suitable to predict the data.

Multiple linear regression

Simple linear regression uses a linear function to predict the value of a target variable y, containing the function only one independent variable x₁.

y =b ₀+b ₁x ₁

After fitting the linear equation to observed data, we can obtain the values of the parameters b₀ and b₁ that best fits the data, minimizing the square error.

Previously, we have calculated two linear models, one for men and another for women, to predict the weight based on the height of a person, obtaining the following results:

Males → Weight = -224.50+5.96*Height
Females → Weight = -246.01+5.99*Height

So far, we have employed one independent variable to predict the weight of the person Weight = f(Height) , creating two different models. Maybe you are thinking 💭 ❓ Can we create a model that predicts the weight using both height and gender as independent variables? The answer is YES! 😄 ⭐️ And here is where multiple linear regression comes into play!

Multiple linear regression uses a linear function to predict the value of a target variable y, containing the function n independent variable x=[x₁,x₂,x₃,…,xₙ].

y =b ₀+b ₁x ₁+b₂x₂+b₃x₃+…+bₙxₙ

We obtain the values of the parameters bᵢ, using the same technique as in simple linear regression (least square error). After fitting the model, we can use the equation to predict the value of the target variable y. In our case, we use height and gender to predict the weight of a person Weight = f(Height,Gender).

Categorical variables in multiple linear regression

There are two types of variables used in statistics: numerical and categorical variables.

Numerical variables represent values that can be measured and sorted in ascending and descending order such as the height of a person.
Categorical variables are values that can be sorted in groups or categories such as the gender of a person.

Multiple linear regression accepts not only numerical variables, but also categorical ones. To include a categorical variable in a regression model, the variable has to be encoded as a binary variable (dummy variable). In Pandas, we can easily convert a categorical variable into a dummy variable using the pandas.get_dummies function. This function returns a dummy-coded data where 1 represents the presence of the categorical variable and 0 the absence.

To avoid multi-collinearity, we have to drop one of the dummy columns.

Then, we can use this dataframe to obtain a multiple linear regression model using Scikit-learn.

After fitting the linear equation, we obtain the following multiple linear regression model:

Weight = -244.9235+5.9769*Height+19.3777*Gender

If we want to predict the weight of a male, the gender value is 1, obtaining the following equation:

Male → Weight = -244.9235+5.9769*Height+19.3777*1= -225.5458+5.9769*Height

For females, the gender has a value of 0.

Female → Weight = -244.9235+5.9769*Height+19.3777*0 =-244.9235+5.9769*Height

If we compare the simple linear models with the multiple linear model, we can observe similar prediction results. The gender variable of the multiple linear regression model changes only the intercept of the line. 🙌

Key takeaways

Simple linear regression is a linear approach to model the relationship between a dependent variable and one independent variable.
Multiple linear regression uses a linear function to predict the value of a dependent variable containing the function n independent variables.
Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics.
Histograms are plots that show the distribution of a numeric variable, grouping data into bins.
Pandas provides methods and functions for exploratory data analysis such as, Dataframe.describe(), Dataframe.info(), Dataframe.dtypes, and Dataframe.shape.
Scatter plots are two dimensional data visualization that show the relationship between two numerical variables — one plotted along the x-axis and the other plotted along the y-axis. Matplotlib and Seaborn provide built in functions to plot scatter plots.
We can fit a simple linear regression model using libraries such as Numpy or Scikit-learn.
Correlation measures the extent to which two variables are related. The Pearson correlation coefficient is used to measure the strength and direction of the linear relationship between two variables.
Residual plots can be used to analyse whether or not a linear regression model is appropriate for the data.
Categorical variables have to be converted into dummy variables to use them in multiple linear regression models.

Thanks for reading 🙌 😍 😉 🍀