First and foremost, it is almost impossible to cover absolutely everything on this topic for various reasons. This blogpost aims to simplify most of the concepts and show their practical applications as much as it concerns a data scientist.
So let’s go!
Introduction
Linear Regression modelling is a type of supervised machine learning algorithm that models the linear (straight line) relationships between independent variables(X) and continuous dependent variables (y). A linear regression problem could be termed as being simple or multiple, depending on its number of features (i.e. single or multiple features respectively). A multiple linear regression should not be confused with a multivariate linear regression, where in that case multiple dependent variables are predicted, rather than a scaler variable.
The term ‘Continuous dependent variables’ here means that the output values are real-valued numbers (such as 112, 15110.15 etc.) in contrast to discrete output values of classification problems.
Consider the the scatter plot of a sample data:

A linear regression model is useful to find the best-fitting straight line (regression line) through the sample points which can be used in estimating a target output (y) based on input features (X).
Implementing a linear model using the Scikit-Learn package as shown below gives an insight on the aim of linear regression modelling:

As seen above, linear regression modelling aims is to fit a predictive model that reflects the linear relationship between a set of independent variables (X) to the output variable (y) of an observed data set of y __ and X values; Such that if an additional value of X is given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y, based on the input X.
# Linear regression function.
y = F(X)
where: y= dependent variable
x= independent variable
The decision as to which variable is set as the dependent or independent variable in a dataset is often based on the presumption that the value of the dependent variable is influenced (linearly in this case) by the independent variable, i.e. a change in X causes some change in y. However, this is not always the case, as there may be some operational reasons to model a variable in terms of others.
Since a linear regression function is the linear combination of the input variables, the straight-line relationship between y and X can be expressed as:
y = mX + c
Where: m is the coefficient.
c is the intercept.
Getting optimal values for the weights and coefficients of the function is key to defining the optimal linear predictor function that accurately derives the output (y) from the inputs (X). The optimal weights are such that they produce minimal prediction error (also known as residuals).
Assumptions of Linear Models.
Linear models make certain assumptions about the features and target variables and their relationships when determining the prediction function. These assumptions include that:
- A linear relationship exists between the predictor and target variables (linearity).
- There is a normal distribution for every feature of the target (multivariate normality).
- None of the features are highly correlated to other features (Non multi-collinearity).
- The predictor variables are error-free (weak exogeneity).
- The target variables have the same variance in their errors, regardless of the values of the predictor variables (homoscedasticity).
While some of these assumptions are unrealistic in a practical sense (like for homoscedasticity and weak exogeneity ), there are some regression methods (such as weighted, generalized and partial least square methods)that are able to handle the errors that may arise from these assumptions.
Now that we understand the aim of linear regression modelling, we should be concerned about how we can achieve this i.e. What concepts are useful in determining the optimal weights of a linear function?
Fitting Linear Regression Models
We will be considering some of the most popular approaches to fitting linear regression models such as the Least square method, Gradient Descent method and penalized estimation methods.
Before delving into these methods and how they help fit accurate linear regression lines, let’s briefly discuss a term which will be used very commonly in this section, which is the Cost Function.
Cost functions are functions that are used to measure the performance of machine learning models, they quantify the error between the predicted and and truth values and help to evaluate the accuracy of the model. Since we are interested in deriving accurate prediction functions, we can measure how good (or bad) our model is behaving using the cost function.
Some of the commonest cost functions used for evaluating regression models are the Mean Absolute Error (MAE), Mean Squared Error (MSE), the Root Mean Square Error (RMSE) and the Residual Sum of Squares (RSS).
We will be utilizing some of these cost functions to study how accurate prediction lines are fitted for regression problems and study the others later on.
Now on to the methods of fitting optimized regressions models:
1. The Ordinary Least Square (OLS) Method.
The Ordinary Least Square Method (OLS) **** is the commonest estimator for fitting linear models. The least-square method is based on the theory that the curve that best fits a given set of observations, is said to be the curve having the minimum sum of the squared vertical distances (residuals or errors) from the given data points. This could be understood as the best prediction function with the least prediction error (cost function).
The residual sum of squared vertical distances (cost function for OLS method) for given data points [(x₁, y₁,) (x₂, y₂)….(xₙ, yₙ)] is defined as:

The idea behind the OLS method is that, since the residual sum of squares evaluates the total prediction (or residual) error for a particular prediction function, we can therefore use this cost function (RSS) to derive the parameter values (m, c) for the function when its prediction error is minimal.
This derivation is achieved mathematically by finding an expression for the coefficient (m) through the partial derivation of the (RRS) error at minimum (0). The expression for m is given thus:

Where x̅ is the mean of the X variables and ȳ is the mean of the y variables.
When the optimal coefficient m is determined using the derived expression above, we can therefore use it to calculate the optimal intercept c by imputing m in the equation:

Thus, we can estimate the parameters of an optimal linear regression line as the parameters with the least sum of the squared errors to the sample points.
Perhaps, a practical example could better explain how the Ordinary Least Square method is utilized in determining the optimal parameters for a prediction function.
Consider the sample data given below:
# independent variables
X = [8, 3, 2, 1, 2, 0, 1, 4, 3, 6, 5, 6, 8, 5, 6, 5, 5, 4, 8, 3]
# dependent variables
y =[12, 1, 2, 1, 5, 1, 2, 6, 4, 9, 6, 10, 14, 10, 6, 4, 9, 10, 8, 8]

As earlier discussed, we would require a prediction function to fit the best line for the sample points. We would utilize the expressions for m and c to get the optimal parameters for the prediction. While we could implement this by hand, it is much easier to implement this with Python code.

We can then define the optimal prediction function for the sample data to be:
y = 1.33x + 0.76

The fitted line has the minimum sum of the squared vertical distances from the given data points.
While the Ordinary Least Squares solution solves the regression problem as closely as possible, its application is limited as its only objective function is the sum of the squared distances. We can utilize a more efficient approach to implementing linear regression modelling for the data using a method known as the Gradient Descent method which can be applied to any objective function, not just squared distances.
2. The Gradient Descent (GD) Method.
Gradient descent is an iterative optimization algorithm for finding the local minima of a function. It is based on the idea that an optimal function can be derived at the global minima of a function when repeated steps are taken in the opposite direction of the initial gradient of that function.
In simpler terms, gradient descent is a way of finding the optimal weights for a function by repeatedly reducing its cost function. This is achieved by progressively moving the gradients of the cost function along the opposite direction of the initial gradient.

How does this help us fit an optimal regression line?
Well, if we can derive the gradient (first-order derivative) of our cost function at a given point, then we can arrive at an optimal prediction function at the global minimal of the curve by descending repeatedly along the opposite direction of the gradient (as shown above).
How then do we descend in the opposite direction of the gradient of our cost function to arrive at the global minima?
To do this, let’s use the Mean Squared Error (MSE) as our cost function. It is the measure of the average of the squared difference between the predicted and actual values. The MSE is given as:

When we input our linear prediction function into the equation we get the relationship as:

The MSE is differentiable and thus can be utilized to optimize the prediction function. Using the MSE, we can derive the gradients of the cost function at a given point as the partial derivatives of m and c such that:

We can then descend the gradients towards the minimal by "moving" in the opposite direction of the derived (initial) gradients in steps. We do this mathematically by calculating current values for m and c using the derived gradients as:

Since we are descending along the gradients in steps, it would be efficient to control the magnitude of steps we take downwards so we don’t miss the global minima as seen below:

To avoid missing the global minima, we would include a parameter known as the Learning Rate, whose basic function is to control the magnitude of steps taken towards the global minimal.

By repeating these steps several times (iteration) and continually determining new values for parameters and gradients, we progressively move towards the global minima that give the optimal parameters for the prediction function with minimal prediction error (cost function).
If the gradient descent approach is not already clear, perhaps a practical approach is preferable. Let’s use this method to obtain an optimal prediction function for the same data used in the previous example using Python code.
# independent variables
X = [8, 3, 2, 1, 2, 0, 1, 4, 3, 6, 5, 6, 8, 5, 6, 5, 5, 4, 8, 3]
# dependent variables
y =[12, 1, 2, 1, 5, 1, 2, 6, 4, 9, 6, 10, 14, 10, 6, 4, 9, 10, 8, 8]

Notice that the continual reduction in the cost (MSE) and how the parameters at the 1200th epoch are equivalent to those determined with the OLS method and the. We could also tune the hyperparameters like increasing or decreasing the values of the learning rate and iterations to control the learning process.
Plotting the cost of the first 50 iterations with a learning rate of 0.001 shows the descent towards the global minimal.

In our example, we calculated the Mean Squared Error for all the samples in each iteration before adjusting the weights through partial differentiation, this is a type of the gradient descent method known as the Batch Gradient Descent (BGD). This method is efficient for relatively small datasets (like in the example) but might prove computationally expensive for larger datasets. An alternative approach for larger datasets is the Stochastic Gradient Descent (SGD) where the parameters are adjusted after the cost of only a random sample is evaluated in each iteration. The most used approach though is the Mini-batch Gradient Descent where a batch of samples is used for training at each iteration, rather than using just a single random sample as in SGD or all the samples altogether as in BGD.
The examples so far have included Simple Linear Regression problems for simplicity sake. The methods discussed could also be applied for Multiple Linear Regression by following the relationship:
y = m₁X₁+ m₂X₂+ m₃X₃+...mₙXₙ + c
Where: mᵢ is the coefficient for each feature.
c is the intercept.
A drawback of the previously discussed Linear regression models is that the coefficient(s) could sometimes take high values leading to instability of the model’s prediction popularly known in machine learning as overfitting. A common approach to tackling this problem is known as regularization.
3. Regularized Methods For Linear Regression.
Regularization or shrinkage is a way of controlling the weights of the coefficients of a prediction function so that they don’t take very large values.
As earlier discussed, linear models make certain assumptions about the features and target variables such as the non-multi-collinearity assumption. If for instance, the data contains negatively correlated features that have opposite effects on the prediction model, this could make the model unstable as small changes in a feature could lead to larger differences in the prediction.
While this problem can be tackled by feature selection (keeping only one of the correlated features in such a scenario), using regularization (or penalized) methods is often a better approach as that would avoid leaving out useful data. These methods tackle this sort of problem by heavily penalizing the coefficients of such features such that they receive extremely low weights. To do this, we will change the cost function of our linear regression to include the coefficients and heavily shrink the coefficients that take very high values.
Ridge regression, LASSO and Elastic nets are the commonest shrinkage methods for Linear Regression problems.
3i. In Ridge regression, we simply add the squared sum of the weights to our least-squares cost function:

By adding the squared sum of the weights to the cost function, the optimization routine has to heavily reduce the value of the coefficients when it tries to minimize the preceding function during training. The alpha parameter is used to control the amount of shrinkage. The greater the alpha value, the greater the shrinkage. Notice that only the weights are regularized and not the intercept. This type of regularization technique is known as the L2 regularization.
A drawback of the L2 regularization method is that it doesn’t promote sparsity and could therefore produce uninterpretable models for high-dimensional datasets. A model is termed sparse if most of its coefficients are reduced to zero. An approach to tackling to issue of sparsity is to use the Lasso regression with L1 regularization.
3ii. The least absolute shrinkage and selection operator (LASSO) is an alternative regularization method that leads to sparse models when compared with ridge regression.
In LASSO, a lot of the coefficients are reduced to zero. It is also useful for feature selection as it selects only one of the correlated features, unlike ridge regression where equal weights are assigned to the coefficients of correlated features, even though this often gives ridge regression models more predictive ability.
LASSO achieves these by penalizing the coefficients by the sum of the absolute value of the coefficients rather than its squared sum as in ridge regression. This is known as the L2 regularization.

While the LASSO method tackles the issue of sparsity, it is also limited as relevant information is lost during feature selection. An approach to reach a compromise between the previous methods is to linearly combine both to produce a regularizer that has the benefits of both the L1 and L2 regularizers.
3iii.The Elastic Net Regularization is a linear combination of the L1 and L2 regularization that shares the merits of both methods. The Elastic Net function is defined as:

Tuning the alpha parameter allows us to balance between the two regularizers.
While using the Elastic net regularizer might seem a good choice when unsure of the regularization method, a better choice could be chosen based on the dataset’s structure and computational requirements.
Evaluation Metrics For Linear Regression Models
Since we have covered how the Residual sum of Squares and Mean Squared error is utilized in evaluating (and fitting) regression models, we would cover two other common evaluation metrics which are the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and the R Squared Error.
i. The Mean Absolute Error (MAE) measures the absolute difference between the predicted value and the actual value. The MAE is given as:

The MAE is dependent on the scale of the input data and can only be used to make comparisons between models trained on data with similar scales.
ii. Root Mean Square Error (RMSE) or Root Mean Square Deviation (RMSD) is the standard deviation of the prediction error. It measures the distribution of the prediction errors and shows the concentration of the data around the regression line. It is determined by taking the square root of the MSE (as its name suggests) and is given as:

iii. R-squared (R2) error **** is a performance metric that computes a ratio between the sum of squared errors and the sum of deviations between the forecast and the average. It generally measures the model’s performance against using an average model, which is often considered a benchmark for most regression problems. The metric is measured on a scale of 0 to 1 with 0 being bad and 1 being perfect. An R-squared of 0 can be interpreted as the model’s performance is similar to using the average of the targets as the predicted values.
The R-squared is computed as:

In real-world applications, we may be interested in more efficient implementations of the methods and concepts covered in this blog post. These methods are easily implementable in Python with Scikit-Learn’s methods.
Finally, Let’s approach a linear regression problem using the Scikit-Learn framework to practicalize how these concepts come together to solve real-world problems.
Implementing Linear Regression Modelling With Scikit-Learn
The Data: We will be using the Boston Housing Dataset. The dataset contains 506 sample points of various houses in Boston through 13 features.
The Task: Build a linear regression model using the dataset to estimate the price of houses in the area given particular features.
- Loading and splitting the dataset: The dataset and some other popular datasets are available in the Scikit-Learn dataset package and can be loaded using the appropriate method call. The dataset is already preprocessed for use so we can go ahead and split it into train and test sets for model building.
Output: A dataframe of the features

-
Building the linear regression model: We build the linear regression model by creating an object of Sklearn’s LinearRegression method. We would set parameters to normalize and centre the data respectively for better processing and modelling. Then we would train the model on the data by passing the training features and target to the model’s fit function. Finally, we will show the intercept and coefficients of the fitted model.
Output: Coefficients for the features -
Model Evaluation: We would evaluate the model by plotting a residual plot and checking its Mean Absolute Error.
The model has an MAE of 3.75. The residual plot shows that most of the residuals are scattered around zero.
Summary: The post highlights most of the concepts, assumptions and evaluations of Linear regression models, both in theory and in practice. It considers the OLS and Gradient descent approaches to resolving linear regression, the use of regularized methods and common evaluation metrics such as MSE, MAE and RMSE. Finally, I showed a simple approach to linear regression modelling using the Sklearn framework.