What Does It Mean to Control for a Variable in Regression?

A detailed explanation

Vivekananda Das
Towards Data Science

--

Photo by Eric Masur on Unsplash

While interpreting the coefficient of one of the predictors (say a continuous variable X1) of a linear model — with multiple explanatory variables (X1, X2, …..Xn) predicting the value of an outcome variable (Y) — you must have used statements such as the following:

“Controlling for other factors/holding other factors constant/accounting for other factors/keeping other factors fixed, one unit increase in X1 is, on average, associated with b units increase in Y.”

As a beginner in data science, I really struggled to understand these questions:

  1. What does it mean to control for a variable?
  2. How exactly do we control for a variable?
  3. Why do we even need to control for a variable?

In retrospect, the confusion originated because I thought regression was about fitting a straight line through some dots (which is correct, but there are other ways of conceptualizing regression). I will answer the above questions using a toy example in this article.

I am intentionally using MS Excel for this article. Although as a data analyst, I prefer other options, I believe Excel is a superior tool for explicitly showing what is inside a black box to a wider population.

Why do we estimate an empirical linear model using regression?

We fit/estimate empirical linear models using regression mainly for two purposes:

  1. Predicting the value of an outcome (for example, annual sales of a store)
  2. Estimating the causal effect of a particular treatment/action/ intervention on an outcome (for example, the effect of accepting a shopper’s card on the monthly spending by customers at a store)

I will come back to the discussion on predictive modeling later. For now, let’s focus on the second purpose.

Estimating the causal effect: A hypothetical example

Last month, the grocery chain XYZ offered a shopper’s card to all the customers who visited its stores in the city of ABC. Some of the customers accepted the card; others did not. The chain has stores in 10 other cities, and before introducing the card in other cities, they are trying to understand how effective accepting the shopper’s card was in terms of affecting monthly spending. You have been hired as the analyst to investigate the question.

First, you realize that the shopper’s card was neither randomly offered nor accepted. As you have been trained in causal modeling, you are worried about confounding variables, i.e., factors that affect both the treatment (accepting the shopper’s card) and the outcome (monthly spending at the store).

You do some background study on the characteristics of those who accepted the card (treatment group) and those who did not (control group). After careful exploration, you find that income is the only difference between the two groups.

Higher-income customers were more likely to accept the shopper’s card; moreover, based on your prior knowledge, you know that higher-income customers are more likely to spend more at the store. You are confident in making the assumption that other than income, the two groups are, on average, identical.

Before you estimate the model using a regression, you draw the following directed acyclic graph (DAG) to explicitly showcase your key assumption:

(Image by the author)

Now, let’s look at the dataset. For simplicity, let’s pretend there are only three income groups: $50k, $75k, and $100k.

(Image by the author)

Because income is the only confounding variable (Shopper’s Card Acceptance ← Income → Monthly Spending), we must “control for” income to identify the causal effect of accepting the shopper’s card on monthly spending. But why? 🤔

Controlling for income means comparing the average monthly spending by the two groups (i.e., customers who accepted the card and customers who did not) within a particular income category.

Again, you assumed that income is the only difference between the two groups. Given this assumption is correct:

  1. If you compare those who accepted the card and those who did not within a certain income category, it is as if the two groups are apples-to-apples (or as if randomly assigned).
  2. Therefore, on average, the two groups are identical, except that one group accepted the card while the other did not.
  3. Finally, within a certain income category, if there is any difference in monthly spending, you can attribute the difference to the shopper’s card acceptance.

Another way to think about this is that controlling for income closes the backdoor path: Shopper’s Card Acceptance ← Income → Monthly Spending, and so, the association from Shopper’s Card Acceptance to Monthly Spending can flow only through the true causal path: Shopper’s Card Acceptance → Monthly Spending.

Now, let’s calculate the income-category-specific difference in spending between those with and without a shopper’s card:

(Image by the author)

Within the $50k income group, the difference in average spending between the two groups is $13.25. Similarly, the difference within the $75k group is $15, and the $100k group is $16.25. These three are income-group-specific average causal effects.

Did you notice that we held/kept the value of income constant in each of the three cases? 😊

However, your boss wants a specific number and not three separate numbers. You need to determine how to weight these three numbers and combine them into one estimate.

You weight each income-category-specific causal estimate by the proportion of people (in the entire sample) present in that specific income category.

In this simple example, we have 18 customers, and there are 6 customers in each of the 3 income categories. So, all three estimates get a weight of 6/18 = 1/3 each. Finally, you get the following:

Average Causal Effect= 13.25*(1/3) + 15*(1/3) + 16.25*(1/3) = 14.83

At this point, you might think this is a lot of work! 😓 Also, we do not get a standard error of the Average Causal Effect by this manual approach. 😔

This is exactly why rather than calculating the above manually, you will prefer estimating the following model using linear regression:

Monthly Spending = b0 + b1*Shopper’s Card Acceptance + b2*Income50 +b3*Income75+ e

where Shopper’s Card Acceptance is a dummy variable which takes a value of 1 if a customer accepts the card and a 0 otherwise; Income50 is a dummy variable which takes a value of 1 if a customer has $50k income and a 0 otherwise; Income75 is a dummy variable which takes a value of 1 if a customer has $75k income and a 0 otherwise. And, e is the error term which consists of all other causes of Monthly Spending; importantly, none of these other causes affect whether a customer accepts a shopper’s card (because you assumed it only depends on Income which you already accounted for in the model).

**As income is a categorical variable with 3 categories, you need 2 dummies to incorporate it into your model. Here, the omitted category is the income100 group. Please note that I am calling the categorical variable “Income” for the story’s sake. In reality, I could have called it anything. Please don’t get confused thinking that income is a continuous variable.**

Next, you estimate the above model using regression and get the following:

(Image by the author)

Efficient and lovely! Isn’t it? 🥰

With the multiple regression approach, you can control for Income and get the Average Causal Effect with standard error, t-stat, p-value, and 95% confidence interval!

Also, if you are afraid that people will not agree with the causal assumption mentioned earlier (i.e., income is the only confounder), you can interpret the coefficient of the Shopper’s Card Acceptance variable as: “Controlling for Income, people who accepted shopper’s card, on average, spent $14.84 more per month at the store compared to people who did not accept shopper’s card.”

The above is perhaps the third application of regression — you can estimate an empirical linear model using regression as a smarter descriptive analysis (i.e., use a multiple regression with categorical predictors to estimate sub-group level average outcomes rather than estimating them manually).

**Given the complexity of the empirical world, unsurprisingly, the third application of regression is extremely popular in empirical research. With this approach, you can be wishy-washy about your assumptions because you prioritize defending yourself from the criticisms you would receive if you made explicit causal assumptions. You are basically trying to estimate/predict the average outcome for certain sub-groups included in the analysis. And you are leaving the interpretation open to the reader: if they are happy just by learning about the difference in average outcomes between two groups, they get what they want; if they want to make any causal inference, they can figure out the needed assumptions on their own.🤷**

**You may have noticed a slight difference between the regression estimated effect (14.84) and the true effect (14.83). This is because regression provides a “weird” variance-weighted causal effect (and not the correct weighted one we estimated manually). I created this fake dataset in a way that the deviation is negligible. Importantly, this issue points to the fact that the third application of regression may go awry unless you are aware of it and try to correct it by estimating the appropriate model. I explain it in detail in another article.**

Predicting the Value of an Outcome

Now, let’s think about another situation. You have been tasked with predicting the annual sales of store XYZ for the next 10 years. In this case, the interpretation of each predictor’s coefficient is meaningless. All you care about is the predictive success of your model. You can literally throw the kitchen sink at your model. Add all the fancy polynomial and interaction terms (of course, don’t overfit 😉).

Most importantly, you care about any statistic showing how well the model fits the data (e.g., R², adjusted R², AIC, etc.). I have seen people interpreting the coefficients of predictive models. I am sorry — it just does not make any sense to me. 🙇Why? Think about the following situation (I explained it in detail in another article).

You are trying to predict the value of the number of people attacked by sharks on a sea beach, and you came up with the following model:

Monthly Shark Attacks = -2.02+0.03*Monthly Ice Cream Sales

This model fits the data well. If you know the ice cream sales in a particular month, you can predict the value of shark attacks with reasonable precision.

But what’s the point in interpreting the coefficient of the predictor? 🤦Yes, you may say, “One unit increase in monthly ice cream sales is, on average, associated with a 0.03 unit increase in monthly shark attacks.” If you had another predictor Z (e.g., sales of potato chips), you would have added “Controlling for Z” at the beginning of the previous statement. But again, why bother interpreting? Can you say that “the positive association suggests if ice cream sales are curtailed/banned, many lives could be saved from shark attacks”? That would be nonsensical!😒

I will end this article by summarizing the key points I discussed above.

1. When we investigate the causal effect of a treatment/action/intervention with non-experimental data, usually, we “control for” confounding variables

2. Controlling for a variable means estimating the difference in average outcome between a treatment group and a control group within a specific category/value of the controlled variable

3. Regression is a convenient estimation strategy that helps us control for confounding variables

4. There is absolutely no point in interpreting the coefficients of the predictors of a predictive model (whether you control for other variables or not)

If you would like to read more on regression and causal inference, here are some suggestions:

--

--

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Wellbeing | Ph.D. candidate @UW-Madison | More: https://vivekanandadas.com