Problems with Multiple Linear Regression, in R

Mathematics is not on your side if you misuse it in any form

Published in

Towards Data Science

9 min readApr 21, 2020

Careful with the straight lines… Image by Atharva Tulsi on Unsplash

Linear regression is a popular, old, and thoroughly developed method for estimating the relationship between a measured outcome and one or more explanatory (independent) variables. For instance, linear regression can help us build a model that represents the relationship between heart rate (measured outcome), body weight (first predictor), and smoking status (second predictor). In this blog, we will see how parameter estimation is performed, explore how to perform multiple linear regression using a dataset created based on data from the US Census Bureau, and discuss some problems that arise as a consequence of removing “bad predictors” as we attempt to simplify our model.

The typical way a linear model is represented is the potentially familiar:

Here, y represents the outcome of a measurement estimated by a line with slope m and intercept b. We can rearrange the equation to have:

and we can further change the variables to be represented as betas:

which represents the typical way a linear regression model is represented as. Notice that we are using y_hat to denote an estimate, not an observation. We can extend this model to include more than one predictor variable:

where x_1, x_2, …, x_p are the predictors (there are p of them). There is a notion from linear algebra that can be invoked in this instance — linear combinations! Notice that the betas, and the predictors x_i (i is the index of the predictor) can be represented as individual vectors, giving us a general matrix form for the model:

Imagine we have N outcomes and we want to find the relationship between the outcome and a single predictor variable. In addition to N outcomes, we will have N observations of a single predictor. Further, imagine that both the outcomes and the observations are stored in matrices. Since the outcome is a single number and there are N of them, we will have an N x 1 matrix representing the outcomes — Y (a vector in this case). In addition, the observations will be stored in an N x (p + 1) matrix, where p is the number of predictors (one in our case). We add a column of 1s to the observations matrix as it will help us estimate the parameter that corresponds to the intercept of the model — the matrix X. Notice that we have added an error term — epsilon — that represents the difference between the prediction (Y_hat) and the actual observation (Y).

The term β is a (p + 1) x 1 vector containing the parameters/coefficients of the linear model. The additional term, ε, is an n x 1 vector that represents the errors of the measurements.

Assumptions

Similar to most, if not all, Statistics tools, linear regression has several assumptions that have to be satisfied in order to model a problem using its principles:

Linearity (duh) — the relationship between the features and outcome can be modelled linearly (transformations can be performed if data is not linear in order to make it linear, but that is not the subject of this post);
Homoscedasticity — the variance of the error term is constant;
Independence — observations are independent of one another i.e the outcome i does not influence outcome i + 1;
Normality — for any fixed value of X, Y is normally distributed.

How to obtain β

When fitting a model, the aim is to minimize the difference between a measured observation and the predicted value of that observation. In linear regression, we are typically attempting to minimize the mean squared error — the mean of the summed squared differences between independent observations and their predictions:

We can minimize the MSE by taking the gradient with respect to beta (parameters) and setting it equal to 0 to get a formulation for beta:

With awareness of how β is derived, we can start the exercise.

We will be using a dataset created by aggregating different types of metrics collected by the US Census Bureau. We are going to try and predict life expectancy in years based on 7 predictors — population estimate, illiteracy (population percentage), murder and non-negligent manslaughter rate per 100k members of the population, percent high-school graduates, mean number of days with temperature < 32 degrees Fahrenheit, and land area in square miles grouped by state. Note that the dataset is from ~1975, is not representative of current trends, and it is exclusively used for the purpose of exercising how to create a linear model:

R is a great tool, among many (Python is also great), for statistics, so we are going to take advantage of it here. To get started, we can create a simple regression model and inspect the significance of each predictor variable:

The syntax is interesting, so let’s go through it:

lm — the command to create a linear model;
Life.Exp~. — the variable we are predicting Life.Exp, explained by (~) all the predictors (.);
data — the data matrix we use to create the model.

We get the following summary (only displaying coefficients’ significance):

When a model is created, R performs significance testing for us and reports the p-values associated with the respective tests of each predictor. If we assume a p-value cutoff of 0.01, we notice that most predictors are… useless, given the other predictors included in the model. In this case, we can perform something akin to manual dimensionality reduction by creating a model that uses only a subset of the predictors (stepwise regression). We can drop predictors in descending p-value order, from most useless, to least useless:

I highly recommend performing a summary call after each model update — the significance test of each coefficient estimate is performed again after one of the features is dropped, which influences the resulting p-values, which can determine whether we continue removing features or not. I have skipped it here in the interest of saving space. A final summary of the model gives us:

We managed to reduce the number of features to only 3! However, there are problems with this approach.

Significance

Performing backwards elimination of variables, similar to how we did in this exercise, only helps us simplify our model for computation purposes and, potentially, improve performance as measured by metrics such as the sum of squares of residuals. A variable that is eliminated from the model does not suggest the variable is not significant in real life. For example, we have eliminated income, which is possibly a “significant” factor in a person’s life expectancy. The world is very complex, and a simple model, such as the one we created, has several drawbacks:

The significance tests that are performed by R are inherently biased because they are based on the data that the model is created on. This has to do with the tests, not R itself;
There are multiple metrics that be used to measure how “good” a model is. For example, R² (coefficient of determination) is a metric that is often used to explain the proportion (range 0 to 1) of variation in the predicted variable as explained by the predictors. The adjusted R² value takes into consideration the number of variables used by the model as it is indicative of model complexity. However, it is possible for a model to showcase high significance (low p-values) for the variables that are part of it, but have R² values that suggest lower performance. According to the following table, we could argue that we should choose the third model to be “the best one” and accept the compromise between balancing an “insignificant” variable and a higher R² value;

However, note that adding an insignificant variable will always increase the R² value and decrease MSE. Albeit insignificant, the addition of the variable can still explain a small percentage of the variation in the response variable, which causes R² to be higher and MSE to be lower;

3. As already alluded to, models such as this one can be over-simplifications of the real world. There are many factors that can influence a person’s life overall and, therefore, expectancy. We have to be mindful of those factors and always interpret these models with skepticism. This pertains specifically to this model as it attempt to model a factor that represents people’s livelihoods. However, this can be extended to any general model we build; be it modelling the climate, yield of chemicals in a manufacturing process, etc. In addition, the principle of skepticism applies to any model architecture, not only regression.

Multicollinearity

Recall how we mentioned linear combinations at the beginning — they play a role in multicollinearity as well. When a dataset showcases multicollinearity, one, or more, of the measured features can be expressed in terms of the other ones in the same dataset. This means that information about a feature (a column vector) is encoded by other features. Technically, the matrix does not have full rank, which means not all columns are linearly independent. The consequence of this is numerical instability and potentially inflated coefficients — that is, β! Let us focus on the inverse of the matrix obtained by multiplying the transpose of X with X itself.

Finding the inverse of a matrix A involves computing the determinant of the matrix. The inverse of the determinant is then multiplied by another term to obtain the inverse. Notice I mentioned “the inverse of the determinant”; that is, 1/determinant(A). When a matrix is not full rank, the determinants will, generally, be a value much smaller than 1, resulting in the inverse of the determinant being a huge value. This is the element of finding the inverse that ends up inflating the coefficients! In R, we can check whether the determinant is smaller than 1 by writing out the matrix multiplication ourselves. Given the dataset we used in the exercise, we can write:

Let’s break down the commands:

cbind command creates a matrix with the specified feature columns of data and stores the matrix in mtx;
t(mtx) takes the transpose of mtx;
%*% is the matrix multiplication operator of R.

It turns out that for the dataset we have used in the example, the determinant is approximately 3e+41, so we get TRUE as the output! It is an important element to check when performing multiple linear regression as it not only helps better understand the dataset, but it also suggests that a step back should be taken in order to: (1) better understand the data; (2) potentially collect more data; (3) or perform dimensionality reduction using principle component analysis or Ridge regression.

Linear regression is, still, a very popular method for modelling. Like with any Statistics tool, care should be taken to: (1) understand data in order to avoid spurious parameter estimations; (2) develop awareness of how the parameter estimates are performed in order to be able to diagnose potential problems before they occur; (3) explain why a coefficient is significant, whereas another may not be, and how this reflects something about the world phenomenon we are attempting to model. Here’s the final code sample: