
My manager thinks I know how to run a regression analysis using R. So, to save my butt, I decided to dedicate my whole weekend to learning how to do it. Think of this post as a crash course intro to learn how to brute force your way into doing one.
Skip to the section you want to read. Table of contents below:
- Part I | My scope of knowledge upon beginning to write this post
- Part II | How I searched for my resources
- Part III | Regression tips: learnings from an engineer
- Part IV | 7 copy & paste steps to run a linear regression analysis using R
- Part V | Next steps: Improving your model
Part I | My scope of knowledge upon beginning to write this post
First, to establish grounds, let me tell you what I do know about regression, and what I can do in R.
What I know about linear regression going into the weekend:
- The equation is in the format: y=ax+b, where y is the dependent variable, x is the independent variable, a is a coefficient, and b is a constant/y-intercept. I know what each of these terms means.
- It’s a way of figuring out the impact the independent variable x has on the dependent variable y. In order to do this, you take the existing data that you have and test all of the cases against this equation to find the most appropriate a and b in order to predict y values that you don’t have data for.
- You can add any number of independent variables with a coefficient attached to each to see the impact each has on the dependent variable. That said, too many variables will not improve the model and in some cases hurt it.
- It’s best to normalize your data so that you work with values between 0 and 1. That way, coefficients aren’t tiny or enormous because of the nature of the independent variables (e.g. 4634 days vs 13 years are two variables you can use in the same model, but because they are so different in size, the coefficients would probably be skewed).
What I can do in R going into the weekend:
- Basic data wrangling in dplyr (
mutate
,filter
,select
, pipe operator%>%
,summarize
, dot placeholder,group_by
,arrange
,top_n
) - Plots in dplyr (
plot
,hist
,boxplot
) - Plots in ggplot2 (the geoms,
facet_grid
, time series plots, axis transformations,stratify
,boxplot
, slope charts) - I learned everything I know about R from two online courses I’ve taken so far (1. R Basics, 2. Visualization).
Part II | How I searched for my resources
How I figured out what to focus on this weekend.
1. r-bloggers has been a good resource for me while learning R basics, so I decided to start there. I googled "r-bloggers regression."

These are the top four links that came up for me:
- https://www.r-bloggers.com/how-to-apply-linear-regression-in-r/
- https://www.r-bloggers.com/linear-regression-using-r/
- https://www.r-bloggers.com/regression-analysis-using-r/
- https://www.r-bloggers.com/linear-regression-from-scratch-in-r/
2. I skimmed through each of them and decided to focus on the second result because I saw at the bottom of the page that it mentioned it was going to tell me how to do a "real-world business problem."

I clicked the link "next blog," and BINGO! "Predict Bounce Rate based on Page Load Time in Google Analytics." Since I didn’t mention already, to note here: I am in the performance advertising space, so this is literally right up my alley. They even do a part 3 on improving the model!
I’ve found what I’m going to focus on this weekend. Going to compile learnings here as I learn anything!
Part III | Regression tips: learnings from an engineer
I had a really helpful conversation with an engineer who entertained my questions this weekend, and I’d like to share with you some tips that he shared. In summary, running a regression analysis is just the start of your investigation in assessing whether some data has a relationship with other data. With that context, here are ways you can ensure you come up with an analysis that is honest and helps you figure out your next steps.
- Normalize the data so that you can compare coefficients as fairly as possible. Though there isn’t a set way to compare coefficients of independent variables apples to apples with each other, normalizing data allows you to at least be able to eyeball the impact that an independent variable has on the dependent variable. This is a great starting point for research: once you see that one coefficient is larger than another, you can begin to investigate what is causing any "high" coefficients. If you don’t normalize your data, you can have a massive range of values for each, resulting in the coefficients also ranging widely in order to compensate for the weight of larger values.
- p-value significance is an indicator of certainty. Even if a coefficient is high, if it is not statistically significant, it’s at best meaningless, and at worst ruining the model. (See section "The Asterisks" in this blog post to learn about how to read p-values.)
- Remove outliers when running the regression, then after creating the model, test the model with each of the outliers to compare the predicted vs true values for the dependent variable. This allows you to see how robust your model is. If the error for the outliers is low, then it’s a huge win for the model; if the error is still high, then you can simply continue to assert the fact that they are outliers.
- In addition to looking at the aggregate error, it’s also important to look at the error of each individual data point. By doing this, you can dig into any reasons or trends as to why a certain point or set of points might have a larger error than others.
- Crap in, crap out. Without good data, you’re not going to have good results. Make sure you know where your data is coming from, and make sure it’s high quality.
Part IV | 7 copy & paste steps to run a linear regression analysis using R
So here we are. Time to actually run a regression analysis using R. As a note, I use RStudio.
General tips and instructions:
- For each step that introduces code, I’ve added a screenshot with my example, and then a code block of the same thing that you can copy & paste into your own R Script.
- The code blocks are for the case that you have 1 independent variable. Follow instructions in the R comments if you have more than one independent variable.
- In each code block, I’ve included brackets in italics that you can replace with your own code, like so:
[[insert your code]]
. - When you replace with your own code, make sure you remove both the brackets AND the text. Don’t add any spaces when you do so.
- Use my screenshots as a guide, especially if you have more than one independent variable. You can execute at each step, or you can execute all at once when you’ve copied and pasted everything.
Here we go!
- Obtain a dataset that includes all the variables you want to test. Choose the dependent and independent variables you want to model off of. Tip: All good regressions start with a question. In order to figure out what variables you want, ask yourself a real-life question to determine what you need. What is the relationship you want to test?
- Clean up your data and save it as a csv file – remove all columns of variables you don’t need. In my case, I want to see whether Conversions are dependent on Spend, imp, click, usv, and pv, so I’ll leave those six and delete everything else. Also, make sure you delete any rows with grand totals in them. After cleaning it up, save it as a csv file.

3. Import the csv file into R Studio with function read.csv()
. (See this [link](https://osxdaily.com/2015/11/05/copy-file-path-name-text-mac-os-x-finder/#:~:text=Right%2Dclick%20(or%20Control%2B,replaces%20the%20standard%20Copy%20option) for how to get the pathname on a mac.)

#import data from csv file
data <- read.csv('[[insert your pathname]]')

4. Remove all NA values of the dependent variable using drop_na()
in tidyverse. See here for the reference that describes this function.

#install tidyverse if you don't have it already
install.packages("tidyverse")
#launch tidyverse
library(tidyverse)
#remove na values of independent variable
data <- data %>% drop_na([[insert column name of your independent variable]])

5. Normalize your independent variables. I did this in R by dividing out each of the independent variables by their respective max value.

#normalize data (add additional %>% for any number of independent variables you have)
norm_data <- data %>%
mutate([[insert a simple name of independent variable]]_norm = [[insert column name of independent variable]]/max([[insert column name of independent variable]],na.rm=TRUE))
To break down what I’m doing above, let’s look at what I did with the column Spend: mutate(spend_norm = Spend/max(Spend,na.rm=TRUE))
mutate()
: function used to append a new column to the existing datasetspend_norm
=the name of my new columnSpend/max(Spend)
: the normalization formulana.rm=TRUE
: argument used to remove the null values
Since I mutated each of the columns with new names, thus creating 5 extra columns, I used the function select()
**** in order to keep just the relevant columns.

#select relevant columns (add additional commas and variable names for any number of independent variables)
select_data <- norm_data %>% select([[insert column name of dependent variable]],
[[insert new normalized column name of independent variable]])

6. Find and remove outliers.
I used the method of not considering anything 1.5 times the interquartile range (IQR) below the 1st quartile or 1.5 times the IQR above the 3rd quartile for my dataset. See here for the reference I used to determine this and the functions I copied. There are two mini-steps in this:
-
- Find the outliers. Determine IQR and upper/lower ranges from the original dataset for each independent variable.
-
- Remove the outliers. Select only the data that falls between the upper and lower ranges found in step 1 from the updated dataset obtained after removing the previous independent variable’s outliers.
I repeated these 2 steps for each independent variable and ended up with the subset removed5
. See my code in RStudio below. (You’ll see that I didn’t do this in the most efficient way possible. Would love any suggestions to make this more efficient.)


#removing outliers
#1. run this code to determine iqr and upper/lower ranges for independent variable
x <-select_data$[[insert new normalized column name of independent variable]]
Q <- quantile(x,probs=c(.25,.75),na.rm=TRUE)
iqr <- IQR(x,na.rm=TRUE)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
#2. run this code to select only the data that's between the upper and lower ranges
removed1 <- subset(select_data,
select_data$[[insert new normalized column name of independent variable]] >
(Q[1] - 1.5*iqr) & select_data$[[insert new normalized column name of independent variable]] <
(Q[2]+1.5*iqr))
#if you're curious, see the new boxplot
ggplot(removed1,aes([[insert new normalized column name of independent variable]])) + geom_boxplot()
#this is the new dataset you'll be working with
View(removed[[insert # of total independent variables you normalized data for]])
########if you have two or more independent variables, copy and paste the code below as many times as you need:
#2nd independent variable ranges - repeating #1 and #2 above
#1. run this code to determine iqr and upper/lower ranges for independent variable
x <-select_data$[[insert new normalized column name of independent variable]]
Q <- quantile(x,probs=c(.25,.75),na.rm=TRUE)
iqr <- IQR(x,na.rm=TRUE)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
#2. run this code to select only the data that's between the upper and lower ranges
removed[[insert # for what number independent variable you are on]] <- subset(select_data, select_data$[[insert new normalized column name of independent variable]] > (Q[1] - 1.5*iqr) & select_data$[[insert new normalized column name of independent variable]] < (Q[2]+1.5*iqr))
#if you're curious, see the new boxplot
ggplot(removed[[insert # for what number independent variable you are on]],aes([[insert new normalized column name of independent variable]])) + geom_boxplot()
#this is the new dataset you'll be working with
View(removed[[insert # of total independent variables you normalized data for]])

7. Regression time! Use the R function lm()
with your data. Go back to the original post I’m learning from for an explanation of what you’re doing here.

#add additional variables as needed with + sign
Model1 <- lm(removed[[insert # of total independent variables you normalized data for]]$[[insert column name of dependent variable]] ~
removed[[insert # of total independent variables you normalized data for]]$[[insert new normalized column name of independent variable]])
You’ve created your model! Now for the summary of results, run the final piece of code for the regression:
summary(Model1)

The engineer I mentioned above looked at these results and immediately said, "Yeah, your data sucks." He said the Std. Error
for each variable is too large compared to its Estimate
. He also mentioned that with just 15 rows of data, using five variables for the model just doesn’t make sense.
But no matter… we did it! Hooray!
Part V | Next Steps: Improving your model
So we’ve run our regression analysis! …but this isn’t the end. As the saying goes, "All models are wrong, but some are useful." So your next steps are to figure out how to improve your model to make sure what you are pulling is actually useful. From the third post of the series I’m learning from this weekend, along with the support of a more detailed post, I found that the combination of the function regsubsets()
**** along with the metrics Adjusted R2, Cp and BIC allows us to figure out how many variables from your dataset are actually useful for the model in question. This helps for any models that have more than 2 independent variables.

regsubsets
function to see whether my variables are any good.install.packages("leaps")
library(leaps)
#add any number of independent variables that you need to the equation (note: this will not work if you only have 1 independent variable)
leaps <- regsubsets(removed[[insert # of total independent variables you normalized data for]]$[[insert column name of dependent variable]] ~
removed[[insert # of total independent variables you normalized data for]]$[[insert new normalized column name of independent variable]],
data=removed[[insert # of total independent variables you normalized data for]],
nvmax=[[insert # of total independent variables you normalized data for]])
summary(leaps)
res.sum <- summary(leaps)
data.frame(
Adj.R2 = which.max(res.sum$adjr2),
CP = which.min(res.sum$cp),
BIC = which.min(res.sum$bic)
)
To explain what I’m doing above using the words of the posts I linked above, here’s an excerpt from one of the posts:
The R function
regsubsets()
[leaps
package] can be used to identify different best models of different sizes. You need to specify the optionnvmax
, which represents the maximum number of predictors to incorporate in the model. For example, ifnvmax = 5
, the function will return up to the best 5-variables model, that is, it returns the best 1-variable model, the best 2-variables model, …, the best 5-variables models.In our example, we have only 5 predictor variables in the data. So, we’ll use
nvmax = 5
. (Source)
And regarding the second half of the code above:
The
summary()
function returns some metrics – Adjusted R2, Cp and BIC (see Chapter […] – allowing us to identify the best overall model, where best is defined as the model that maximize the adjusted R2 and minimize the prediction error (RSS, cp and BIC).The adjusted R2 represents the proportion of variation, in the outcome, that are explained by the variation in predictors values. the higher the adjusted R2, the better the model. (Source)
The best model, according to each of these metrics, they mention, is in the code block below and produces the results following.
res.sum <- summary(leaps)
data.frame(
Adj.R2 = which.max(res.sum$adjr2),
CP = which.min(res.sum$cp),
BIC = which.min(res.sum$bic)
)

Based on the results, Adjusted R2 tells us that the best model is the one with 1 predictor variable, as does the Cp and BIC criteria. It’s saying I should decrease the number of variables in my model from five down to one. This isn’t surprising since I only had 15 rows of data to begin with.
And with that, I conclude my weekend bonanza of learning linear regression using R. If you’re like me at all, brute forcing it and learn by doing is the perfect start to getting to know a topic. I hope you got as much out of it as I did. Onward!