The world’s leading publication for data science, AI, and ML professionals.

Back to Basics – Linear Regression in R

Linear regression is one of the most fundamental knowledge in statistics, here's how to perform and interpret it in R

Photo by Jean-Philippe Delberghe on Unsplash
Photo by Jean-Philippe Delberghe on Unsplash

It’s been a while since my last article on here and that’s because I have been busy preparing for my actuarial exam that is coming up in just two months. In the process of studying these past couple of weeks, I ran into a good old friend from way back in my first ever statistics class, linear regression.

As I started to learn more complex machine learning algorithms, I sometimes get caught up with building the fanciest model to solve a problem when in reality, a simple linear model could have easily gotten the job done. It wasn’t until I started refreshing my knowledge around linear regression that I gained a newfound appreciation for its simplicity and ease of use.

So, I thought it’d be a good idea to use this article to not only reinforce my understanding of linear regression but also take this opportunity to share this fundamental topic in statistics with others.


Introduction

To keep this article as concise as possible, I will not be covering every small detail of linear regression but instead, offer just enough for you to gain a good understanding of how to perform and interpret it in R. I will also briefly touch upon the topic of ANOVA, that is the analysis of variance as well as the coefficient of determination or more commonly known as R-squared.

With that said, here we go!

Linear regression is a linear model which plots the relationship between a response variable and a single explanatory variable (simple linear regression) or multiple explanatory variables (multiple linear regression).

Since we were talking about my actuarial exam, let’s just use that as an example.

Suppose I would like to study the relationship between the hours someone spends studying and their corresponding examination score. I will first need to collect data points for my study. Then, I will proceed to plot these data points on a graph where the x-axis represents the hours someone spent studying and the y-axis represents their final examination score.

Note that the final examination score in this example represents our response variable as it is the variable that we’re investigating. On the other hand, the hours spent studying is called an explanatory variable as it is the variable that will influence and "explain" the outcome of the response variable.

Line of best fit

But hold on, so far we have only discussed the data points that go onto the graph. How does the regression line actually come about?

If we try to draw a line through all the data points, chances are, not all of them will fit directly on that line. So how do we know how to best position the line? This is where the concept of line of best fit comes in.

The idea is to fit a line through all the data points such that the line minimises the cumulative distances or in regression terms, the residuals, between the observed values and the fitted values. This process is called fitting a line of best fit.

I highly encourage checking out the video below for a more visual explanation of this particular concept.

As a result, we should expect to see a linear line that is trending upwards or in other words, has a positive slope and that is because the more hours someone spends studying for an exam, quite naturally, the better the person performs in that exam.

Case Study

Now that we got all the basics out of the way, let us now see how we can perform a simple linear regression in R.

Suppose we have a 6×2 dataframe which measures the estimated baby weights at different time periods during a pregnancy.

Plotting the data points

First, let’s store this dataframe into a variable called baby.

# Load data
baby = read.table("baby weights.txt", header = T)
# Show data
baby

We can then plot these 6 data points on a graph using a scatterplot.

# Plot data
plot(baby, main = "Estimated baby weights during pregnancy", xlab = "Gestation period (weeks)", ylab = "Weight (kg)")

Main, xlab and ylab are simply arguments within the plot function in R which specifies the main title, x-axis label and y-axis label of the plot respectively. The following is the result of that plot.

Scatterplot of estimated baby weights during pregnancy
Scatterplot of estimated baby weights during pregnancy

It does look like there is a positive relationship between our explanatory variable, gestation period and the response variable, weight. In other words, the longer the baby stays in the womb, the heavier it gets. Sounds logical.

We can examine the correlation between the two variables by using the cor function in R and further test if it is statistically significant.

Also, we can attach the baby dataframe so that we can access its content without repeatedly calling the variable, just to make our analysis easier.

# Attach data
attach(baby)
# Compute correlation between gestation period and baby weight
cor(gestation, weight)
# Test whether correlation coefficient is zero
cor.test(gestation, weight)
Correlation test
Correlation test

There is a very strong positive correlation between the two variables. The output also clearly indicates that this is a significant result. Thus, we can now proceed to fit a linear line through those data points.

Fitting a linear regression model

Fitting a linear regression model in R is extremely easy and straightforward. The function to pay attention to here is lm, which stands for linear model.

Here, we are going to fit a linear model which regresses the baby weight on the y-axis against gestation period on the x-axis. The order here is important and worth remembering, the response variable always comes before the explanatory variables.

We will store the linear model in a variable called model so that we can access the output at a later stage.

# Fit linear model
model = lm(weight ~ gestation)
# Examine model
model
summary(model)

There are a number of ways we can examine the linear model but the most comprehensive one is using the summary function.

Summary
Summary

As we can see, there are a few sections in this particular output but don’t worry, we will go through each section and discuss exactly what they mean later. For now, let’s go ahead and add this linear model to our scatterplot from earlier on by using the abline function.

# Add regression line
abline(model, col = "red", lty = "dashed")
Scatterplot with regression line
Scatterplot with regression line

Fitted values

Fitted values share the same x values as the observed data, except they lie precisely on the regression line.

In this section, we will look at how we can obtain these fitted values as well as how to add them to our existing regression line. Again, there are a few ways we can go about this and they all give the same result.

# Obtain fitted values 
model$fitted.values
fitted(model)
predict(model)
# Add fitted values to regression line 
points(gestation, fitted(model), col = "blue", pch = 16)
The blue points represent the fitted values
The blue points represent the fitted values

Analysis of variance (ANOVA) & R-squared

Variance is a statistical measurement of the spread between the numbers in a dataset. More specifically, how far each data point is away from the mean. Analysis of variance is, therefore, the study of variance.

By dividing total variance into two components, we can assess the amount of variance that is accounted for by our regression line and those that are not accounted for, called residuals.

R-squared is the percentage of the response variable variation that a linear model explains. The higher the R-squared values, the smaller the differences between the observed values and the fitted values.

However, R-squared alone is not a sufficient indicator of whether or not a regression line provides a good fit. We have to also take into consideration the context of the study as well as the statistical significance of each explanatory variable in the regression model.

# ANOVA
anova(model)
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)

The p-value of 0.0003661 from the F-test confirms that there is a linear relationship between gestation and weight. This result is consistent with our correlation test from earlier on.

We can also see that the regression sum of squares is 2.92129 and the residual sum of squares is 0.09371, which gives a total sum of squares of 3.015.

There are two ways we can work out the sum of squares. One is by first principles and the other is by using the output from the anova function.

x = gestation; y = weight
# Total sum of squares
(SStot = sum((y - mean(y))^2)
sum(anova(model)[,2])
# Residual sum of squares
(SSres = sum((y-fitted(model))^2))
anova(model)[2,2]
# Regression sum of squares
(SSreg = sum((fitted(model)-mean(y))^2))
anova(model)[1,2]

Similarly, there are three ways to calculate the coefficient of determination or R-squared and all three of them give the same result of 0.9689173.

  1. From the summary output
  2. Square of correlation
  3. Dividing regression sum of squares by total sum of squares
# Method 1
summary(model)$r.squared
# Method 2
cor(gestation, weight)^2
# Method 3
SSreg / SStot 

Intercept & regression coefficient

Since we only have one explanatory variable in this particular study, we have two coefficients in total, one for the intercept and the one for the slope.

Recall the output from the summary function.

Summary
Summary

We can see that the slope parameter has a p-value of 0.000366 which signifies that it is statistically significant or in other words, non-zero.

To take this a step further, we can also produce a confidence interval for the intercept and the slope parameter.

# Confidence interval for intercept and regression coefficient 
confint(model, level = 0.99)
Confidence interval for the intercept and regression coefficient
Confidence interval for the intercept and regression coefficient

Making predictions

The goal of fitting a linear model is to make predictions that are of reasonable accuracy.

In this section, we will look at two different ways we can make predictions using a simple linear model.

  1. By first principles
  2. Using the predict function

In addition, we can also produce confidence intervals and prediction intervals on these predictions to evaluate their accuracy.

Suppose we would like to predict a baby’s weight at week 33, here is how we can do it via two separate methods.

# First principles
coef(model)[1] + coef(model)[2] * 33
# Using predict function
newdata = data.frame(gestation = 33)
predict(model, newdata)
# Confidence interval 
predict(model, newdata, interval = "confidence", level = 0.9)
# Prediction interval
predict(model, newdata, interval = "predict", level = 0.9)

Both methods should give the same prediction value of 2.141429.

Residuals

Recall that residuals are the differences between the observed values and the fitted values in a regression. There is an underlying assumption in regression and that is that residuals ought to follow a normal distribution.

In addition to checking the normality assumption, by examining the trends of the residuals, we can also evaluate how well our regression line fits the data points i.e. observe if there is any particular section of where the model is overestimating or underestimating the true data points.

In this section, we will discuss how to obtain the residual values as well as how to use various plots to analyse the patterns that are present in these residuals.

There are two ways to obtain the residuals:

  1. Using the resid function
  2. Subtract the fitted values from the observed values
# Method 1
resid(model)[1]
# Method 2
weight - fitted(model)

Next, we will look at how to visualise the trends in the residuals. There are different plots that we can use here and each plot can provide us with different information about the behaviour of the residuals.

# Plot residuals vs fitted
plot(model, 1)
Residuals vs fitted
Residuals vs fitted

Residuals are hovering around the value 0 which is a good indicator that the linear model is a good fit. Point 2 does look like an outlier but it is difficult to tell with only a few values.

Another plot that we can use is called the QQ plot which is a scatterplot that plots two sets of quantiles against one another. In our case, we are interested to see if the residuals follow a normal distribution and therefore we will plot the quantiles of the residuals against those of a normal distribution.

If our residuals are indeed normally distributed, we should expect to see a plot that is diagonal from the bottom left to the top right.

# QQ plot
plot(model, 2)
QQ plot
QQ plot

The diagonal line appears very close to the majority of the data points except point 2. This might suggest that it is worth dropping point 2 from the dataset to allow for a more accurate linear model.


Conclusion

Well done for making it all the way through this article. I hope you have learned a little bit more about the concept of linear regression as well as how to perform and interpret it in R.

To recap, in this article, we have covered:

  • What a linear model is
  • How to fit a line of best fit
  • Analysis of variance and R-squared
  • Significance of the intercept and regression coefficient
  • How to make predictions using a linear model
  • Residuals

Do check out my other articles if you are interested in learning about Data Science and machine learning.

Happy learning!

Customer Segmentation Using K-Means Clustering in R

What is Feature Scaling & Why is it Important in Machine Learning?


Related Articles