
Regression is the technique of predicting a continuous variable, which is a numeric variable with an infinite number of values between any two values. Linear regression is one of the most important and widely used regression models largely due to its simplicity and ease of interpretation. Other popular regression models include polynomial, ridge, Lasso, elastic net, principal component regression, support vector regression, Poisson, and many others.
Linear regression is a statistical model that assumes a linear relationship between the input/independent (x) and the target/predicted (y) features and fits a straight line through data depending on the relationship between x and y. In situations where there are many input features, x = (x₁, x₂,… xₙ) whereby n is the number of predictor features. This is referred to as a Multiple Linear Regression. A Simple Linear Regression is when there is only one input (x) feature. Linear Regression usually requires only 1 target feature y that is to be predicted/estimated.
ŷ=β₀+β₁x
is the Simple linear regression model formula.
ŷ is the predicted value of y for a given x. This is the feature we are trying to estimate or predict. All ŷ values fall on the linear regression line. β₀ and β₁ are the regression coefficients.
- β₀ is called the intercept. This is where the line intercepts the y-axis, and it’s equivalent to the predicted value of y when x=0
- β₁ is the coefficient of the input feature x, and it’s the slope of the line. It represents the effect x has on y. Therefore the linear regression model assumes that if x increases by 1, y increases by β₁ (This is only true when x and y have a perfect linear relationship, which is rarely the case)
- β₀ and β₁ are both learned from the dataset by the model.
Thus, when you fit a linear regression model, the job of the model is to estimate the best values for β₀ and β₁ based on your dataset.
ŷ=β₀+β₁x₁+β₂x₂+...+βₙxₙ
is the formula when there are multiple input features. ŷ is the predicted value of y for a given x. β₀, β₁, β₂ and βₙ are the regression coefficients, where n is the number of input features.
A single Linear Regression model (one x and one y feature) results in a two-dimensional line plot that is simple to display. A multiple regression model (several x features and one y feature) results in a higher-dimensional line called a hyperplane.
Linear Regression models also include errors, called residuals. These are the differences between the true values of y, and the predicted values of y, or y-ŷ.

The linear regression model minimizes these errors by finding the ‘line of best fit’, or the line when these errors are smallest. Think of one line closest to all the points. This involves minimizing the sum of squared of errors (SSE), also called the residual sum of squares (RSS), a technique called the ordinary least squares method. This technique aims to minimize the RSS, which means that when we have a linear regression line through the data, calculate the distance from each data point to the line, square it, and sum all the squared errors together. The RSS is smallest for the best-fit line.
Another technique for computing the ‘best-fit line’ that is common in Machine Learning is the Gradient Descent. This involves optimizing the coefficients by starting with random values for the coefficients, then gradually updating them in the direction towards minimizing the sum of squared errors.
Implementing Linear Regression in Python using scikit-learn
Let us now see a Multiple Linear Regression model in action using a scikit-learn dataset. You can download the code from this Github link and follow along.
Scikit-learn comes with a few small standard datasets that do not require you to download any file from some external website. Using the Boston house prices dataset, we will implement a Multiple Linear Regression model and measure its performance by computing the R² score and Mean Squared Error, and also display the intercept (β₀) and the β coefficients for the input features. You can view the description of the Boston housing dataset by running print(df['DESCR'])
.
If you do not have python installed on your machine, follow [these](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/execute.html) steps to download and install anaconda, which is a python environment platform. Once installed, use these steps to open a new Jupyter notebook from where we will run our code.
First thing is to import the libraries you will need. These include the commonly used pandas
, numpy
, matplotlib,
and seaborn
. Also import the linear regression
class and the datasets
class from where we will load our dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.linear_model import LinearRegression
from sklearn import datasets
The next step is to load the dataset.
data = datasets.load_boston()
The pre-set feature names will be our predictors which we will store in a Dataframe called features
, and the predicted (target) feature will be the pre-set target, ‘MEDV’, which we’ll store in a Dataframe called target
.
features = pd.DataFrame(data.data, columns=data.feature_names)
target = pd.DataFrame(data.target, columns=["MEDV"])
features.shape
### Results
(506, 13)
target.shape
### Results
(506, 1)
The next step is to initialize an instance of the Linear regression model then fit the model instance to the data using model.fit(features,target)
, also known as training the model. This is the step where the best estimates of β₀ and β₁ are derived. Finally, we will make predictions using model.predict(features).
Note: I used all the 13 features to fit the model and make predictions, but you can use fewer features if you performed Exploratory Data Analysis(EDA) and concluded on few useful features. Read my article on basic EDA code blocks for guidance.
model = LinearRegression()
model.fit(features, target)
preds = model.predict(features)
Observe the first 5 values of target column ‘MEDV’ and the first 5 predictions.

Sklearn has inbuilt functions that return the score(R² score), coefficients (β₁,..,βₙ), and the intercept (β₀).
model.score(features,target)
returns the R² score of our model which is the percentage of explained variance of the predictions by the model. It compares the current model with a constant baseline which is chosen by taking the mean of the data and drawing a horizontal line at the mean. R² scores are always less than or equal to 1 and a higher value is preferred. A high R² may also be due to overfitting, in which case an adjusted R² can be calculated. More on the derived R² score here.
model.score(features, target)
###Results
0.7406426641094095
We will also compute the Mean Squared Error (MSE) which is the average of the squared difference between the actual target value (y) and the predicted values (ŷ) or 1/n(y-_ŷ)_². This is a good performance indicator because squaring the differences results in a bigger emphasis on larger errors signaling how good/bad the model is. A smaller MSE is preferred. To calculate the MSE, we use the mean_squared_error
class imported from sklearn.metrics.
from sklearn.metrics import mean_squared_error
print('MSE', mean_squared_error(target, preds))
### Results
MSE 21.894831181729202
We can display the y-intercept (β₀) using model.intercept_.
Our model’s y-intercept is 36.45948839, which is the value of y when x=0.
print(model.intercept_)
### Results
[36.45948839]
model.coef_
displays the model’s estimated coefficients for the various input features. They are derived by the model during training.
print(model.coef_)
### Results
[[-1.08011358e-01 4.64204584e-02 2.05586264e-02 2.68673382e+00
-1.77666112e+01 3.80986521e+00 6.92224640e-04 -1.47556685e+00
3.06049479e-01 -1.23345939e-02 -9.52747232e-01 9.31168327e-03
-5.24758378e-01]]
Linear Regression Assumptions
Linear regression makes assumption in the data, which if not met may result in an inaccurate model with bad predictions. These are some of the assumptions made.
- Independent/predictor variables are error-free and are not random variables. We discussed above that errors are expected from the predicted values ŷ which we called residuals.
e = y-ŷ.
However, linear regression assumes that we do not expect any errors in the individual input/predictor features. - Linearity. x and y features are assumed to have a linear relationship, meaning that when x increases (or decreases), y increases(or decreases).
- Features are normally distributed. The features are assumed to have a normal distribution, and the presence of highly-skewed features or those with significant outliers can distort the relationships and lead to an inaccurate model.
- Linear regression assumes that the magnitude of errors is uncorrelated with other errors.
- The model assumes there is no multicollinearity and no feature is redundant with other features.
Conclusion
In this article, we covered Linear regression, one of the common regression models for predicting a response feature y using input features x₁,…,xₙ. We looked at the basics of linear regression and implemented a multiple linear regression model using one of sklearn’s toy datasets. Here is the complete code on Github. In practice, it is important to split the dataset into training and testing sets and to standardize/normalize the features to prevent overfitting. This is in addition to other practices such as EDA, cleaning the dataset, feature engineering among other data-preparation best practices.
References
Machine learning mastery – https://machinelearningmastery.com/linear-regression-for-machine-learning/
Youtube: GeostatsGuy Lectures – https://www.youtube.com/watch?v=0fzbyhWiP84