The world’s leading publication for data science, AI, and ML professionals.

Linear Regression Explained (in R)

An explanation of residuals, sum of squared residuals, simple linear regression, and multiple linear regression with code in R

Regression | Image: Wikipedia
Linear Regression (Source)

Linear Regression is one of the first concepts we learn in data science and machine learning. Yet, many are confused by linear regression and the common terminology associated with it. In this article, we explore linear regression step-by-step. We discuss residuals, sum of squared residuals (or errors), simple and multiple linear regression, and linear regression terminology. We then bring everything together in a simple example of linear regression in R.

We are going to examine the different components of Linear Regression using a data-set based on the seven countries study, which examined factors that affect cardiovascular disease around the world (please note that this data-set, while based on the study, is simulated data). The data-set is available to download as a .csv here.

We are first going to examine the following two columns (variables): the ID of the country and the coronary heart disease mortality (per 10,000 in the population):

Let’s consider the relationship between coronary heart disease mortality and the arbitrary ‘ID’ assigned to each country. Since the ID column is chosen arbitrarily, we would not expect it to have a strong relationship with coronary heart disease mortality. A scatter plot of the two variables confirms our hypothesis: there is not a strong linear relationship between these columns.

Let’s plot a line over these points. Since we have hypothesized that there is no relationship between the variables, let’s draw a line with a slope of 0, indicating no relationship.

Now we are going to determine the vertical distance between each point and the line we just drew. Start by drawing arrows from the line to each point.

If we examine the distances between the line and the point, we will see that some points are very close to the line (i.e. +3, -4) and some are very far away (i.e. +11, -9). For one point, it is exactly on the line and we see that the distance is 0.

These vertical distances from the line to each point are called residuals. For data points above the line, the residual is positive, and for data points below the line, the residual is negative.

If we consider the line a "prediction" of where points should be (based on our hypothesis that there should be no relationship between our variables), the residuals are just the observed y-values (points) minus the predicted y-values (line). One way to think about residuals is that they are how far data "falls" from the line.

Now let’s add up all of the residuals above the line and all of the residuals below the line. We will see that the positive residuals (above the line) add up to be 24 and the negative residuals (below the line) add up to be -24. If your line is truly a line of best fit, your residuals will always sum to 0.

The problem with residuals is that we don’t know the absolute differences from the line. Our solution to this is to square the residuals. This gives us absolute differences. This also emphasizes large deviations.

By summing up these squared residuals, we get a good estimate of the error. (The error being the difference between the observed value and the predicted value). We call this the sum of squared residuals, or sum of squared errors (SSE). You may also hear this referred to as the residual sum of squares.

Below is the equation for the sum of squared errors, where SSE is the sum of the squared errors (or residuals).

So now that we have learned about residuals and the sum of squared errors, we can discuss linear regression!

The goal of linear regression is to create a linear model that minimizes the sum of squared residuals.

So let’s go back to our data. We now want to look at another variable and examine its relationship on coronary heart disease mortality. This variable, "Smoking", is the average cigarettes per adult per day in that country.

If we plot coronary heart disease mortality against average cigarettes per adult per day, we will see a strong linear relationship:

In simple linear regression, we attempt a number of best fit lines until we find the one that minimizes the sum of squared errors. Below, you can see a visual representation of three different lines examined. For each line, the sum of squared errors is calculated. The line with the lowest sum of squared errors is the line of best fit (here, this is the red line).

Regression is just a methodology used for modeling and analysis of numerical data. In regression, relationships between 2+ variables are evaluated. Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships.

If we look at the terminology for simple linear regression, we will find an equation not unlike our standard y=mx+b equation from primary school.

The "y" is called the dependent variable, outcome variable, or response variable. The "X" is called the independent variable, predictor variable, explanatory variable, or regressor variable. β0 is referred to as the intercept, β1 is the slope, and ε is the random variable. The slope and intercept (β1 and β0, respectively) are also called the regression coefficients.

If we go back to our previous example now, we are given the equation of our simple linear regression to be y = 0.25 + 2.41x.

Let’s ask ourselves a couple of interpretation questions to make sure we understand our equation for simple linear regression:

  1. What is the interpretation that β1=2.41 ?
  2. If we had a 9th country with an average cigarettes per adult per day = 20, what would you predict the mortality rate would be? What is the expected value of Y?

  1. What is the interpretation that β1=2.41 ? The expected change in CHD with a 1 unit change in average cigarettes per adult/day.
  2. If we had a 9th country with an average cigarettes per adult per day = 20, what would you predict the mortality rate would be? What is the expected value of Y? If we plug "20" in for X in our above equation, we find that the expected value of Y is 48.45, which we round to 48.

We have examined simple linear regression and we can determine the relationship between our outcome variable and regressor variable. But what if we had other potential regressor variables, such as those below?

  • Average Age of Population
  • Average Amount of Saturated Fat Consumed
  • Minutes of exercise daily on average
  • Country’s favorite number

This is where multiple linear regression comes in! Multiple linear regression is the same as simple linear regression, but now we have a regression coefficient for each of the regressor variables, as shown below. This allows you to predict an outcome variable using many different variables, all of which may have an effect on the outcome.

In this example, age, diet, saturated fat, exercise, and smoking may all have an effect on the coronary heart disease mortality in a country and we want to include all of these in our linear regression model.


Code

The data-set is available to download as a .csv here.

Import the lme4 library in R. This will allow you to do linear regression models. Here, you are doing a simple linear regression, predicting the mortality (CHD) using the regressor variable "Smoking", so you will set up your linear regression like this:

CHD ~ Smoking

Then you can plot your line of best fit using an abline.

To determine your regression coefficients and your sigma (used to compute the random variable in your linear regression equation), you will call the model, print the model, and print the sigma value of the model:

Your output will show your regression coefficients, β1=2.41 and β0=0.246 in addition to your sigma value of 3.42.

Multiple linear regression in R is just as simple. You will add a "+" between regressor variables. We add (1|ID) to tell the model that ID is a group-level variable.

Want to dive even deeper into using R for linear regression? Check out mixed effects models, which can also be done using the lme4 library!


In summary, we explored residuals, which are the distance between the regression line and each observed value. We learned that the objective of linear regression is to minimize the sum of squared residuals. We developed an intuition for linear regression, which can be used for prediction, estimation, hypothesis testing, and modeling causal relationships. We learned that we can use multiple linear regression to take into account multiple regressor variables. Finally, we learned how to do linear regression in R!


Related Articles