Interpreting Regression Coefficient: Wish I Had Known This Before

A Detailed Explanation of the Regression Estimator

Vivekananda Das
Towards Data Science

--

Photo by Samia Liamani on Unsplash

If you have taken an introductory applied econometrics/statistics/data science course, you must be familiar with the interpretation of the coefficient of a predictor in a linear model estimated using regression. Today, I will share an example to highlight that — just like this author — you (probably) passed the introductory course/courses with a misunderstanding of the coefficient of interest.

**This article is a continuation of my previous article, and I will use the same toy example (with different numbers). In case you missed it, consider having a quick look. However, the reading is optional because I will repeat the necessary parts.**

A hypothetical example

Last month, the grocery chain XYZ offered a shopper’s card to all the customers who visited its stores in the city of ABC. Some of the customers accepted the card; others did not. The chain has stores in 10 other cities, and before introducing the card in other cities, they are trying to understand how effective accepting the shopper’s card was in terms of affecting monthly spending. You have been hired as the analyst to investigate the question!

First, you realize that the shopper’s card was neither randomly offered nor accepted. As you have been trained in causal modeling, you are worried about confounding variables, i.e., factors that affect both the treatment (accepting the shopper’s card) and the outcome (monthly spending at the store).

You do some background study on the characteristics of those who accepted the card (treatment group) and those who did not (control group). After careful exploration, you find that income is the only difference between the two groups. Higher-income customers were more likely to accept the shopper’s card; moreover, based on your prior knowledge, you know that higher-income customers are more likely to spend more at the store. You are confident in making the assumption that other than income, the two groups are, on average, identical.

Let’s look at the dataset. For simplicity, let’s pretend there are only three income groups: $50k, $75k, and $100k.

This fake dataset has 400 observations but you can focus only on these 16 observations. I copied and pasted these 16 observations 25 times to make a sample size of 400 so that I get statistically significant results :D

Estimating Average Causal Effect Manually

Angrist and Pischke, 2009 in their famous econometrics textbook “Mostly Harmless Econometrics” mention the following expression of Average Treatment Effect (ATE):

**I use the term Average Causal Effect in this article to refer to the same estimand (i.e., the thing I am trying to estimate)**

In our example, delta_x is the difference in average monthly spending between those who accepted the card and those who did not accept the card at each value of X_i (income category). P(X_i=x) is the probability mass function for X_i (i.e., the proportion of total respondents in each of the three income categories).

In simpler words, you are going to do the following:

  1. Calculate the difference in average monthly spending between those who accepted the card and those who did not (delta_x) within each of the three income categories (X_i).
  2. Multiply each income-category-specific difference (delta_x) by the proportion of total respondents in that particular category P(X_i=x), and add them up into one estimate.

Average Causal Effect = 10.75*(5/16)-20*(6/16)+18.75(5/16)=1.72

Great! But, as we discussed in the previous article, we prefer estimating the Average Causal Effect using regression because we want to get the point estimate and the standard error (and also because we are smart 🤓).

Estimating Average Causal Effect Using Regression

You estimate the following conventional model using regression:

Monthly Spending = b0 + b1*Shopper’s Card Acceptance+ b2*Income50 +b3*Income75+ e

where Shopper’s Card Acceptance is a dummy variable which takes a value of 1 if a customer accepts the card and a 0 otherwise; Income50 is a dummy variable which takes a value of 1 if a customer has $50k income and a 0 otherwise; Income75 is a dummy variable which takes a value of 1 if a customer has $75k income and a 0 otherwise. And, e is the error term which consists of all other causes of Monthly Spending; importantly, none of these other causes has any effect on whether a customer accepts a shopper’s card (because you assumed it only depends on Income which you already accounted for in the model). Lastly, in this conventional model, b1 is the regression estimator of the average causal effect.

**This is how I would interpret b1; otherwise, what’s the point in using regression in this particular context? 😕**

This is what I get when I estimate the above model using regression:

Goodness! 😲

Although the true Average Causal Effect is 1.72, the regression estimated Average Causal Effect is -2.06 despite controlling for the confounding variable!

Let’s forget causality for a moment.

Let’s pretend you are doing a descriptive/correlational study. In that case, you would make either one or both of these conclusions:

  1. Controlling for income, shopper’s card acceptance is negatively and significantly associated with monthly spending.
  2. Controlling for income, on average, the monthly spending by customers who accepted a shopper’s card is $2.06 less compared to the same by customers who did not accept a shopper’s card. This difference is statistically significant at the 5% significance level.

Guess what! Based on the regression estimate, your (descriptive/correlational) conclusion will be significantly wrong. And, your boss — who believes in data-driven decision-making — will interpret this conclusion as evidence of the shopper’s card having an overall negative impact on monthly spending.🥺

Understanding the Regression Estimator

Let’s understand why the regression estimate may differ from the true Average Causal Effect.

Angrist and Pischke, 2009 mention that the ordinary least square (OLS) regression uses a different weight to estimate the average causal effect. The regression estimator can be expressed as:

**In our case, D is the treatment (Shopper’s Card Acceptance), and X is the confounding variable/covariate (Income)**

Let’s use this formula and see if we get the same estimate as the one from the regression (i.e., -2.06). We use the following steps:

(1) Calculate the income-category-specific weights:

(2) Calculate delta_Regression:

delta_Regression = (10.75*0.05+20*0.09375+18.75*0.05)/ (0.05+0.09375+0.05) = -2.06

Interesting, Isn’t it? 🤔

According to Angrist and Pischke, 2009, “Regression puts the most weight on covariate cells where the conditional variance of treatment status is largest.” So, the regression estimator estimates a variance-weighted average causal effect, which we do not long for 🥺.

In our example, the variances of shopper’s card acceptance within the three income categories are:

**I used the VAR.S function in Excel**

As the income equal to $75k group has the highest variance in Shopper’s Card Acceptance, the difference in monthly spending between those who accepted the card and those who did not within the income equal to $75k group receives the highest weight.

What’s the consequence of this type of weighting? In our example, the average causal effects within the $50k, $75k, and $100k groups are 10.75, -20, and 18.75, respectively. The $75k group gets the highest weight in the regression estimator, which explains why we got a negative coefficient of the treatment variable (Shopper’s Card Acceptance) using regression (although the true Average Causal Effect is positive).

What’s the Remedy?

Schafer and Kang, 2008 mention a solution to this issue. It involves the following:

(1) Center all the covariates by subtracting from each covariate the sample mean for that covariate. This can be done for all covariates by replacing the vector X_i with (X_i - E(X_i)).

(2) Compute the product of the treatment variable with each centered covariate.

Based on the above, in our example, we have to estimate the following model:

Monthly Spending = b0 + b1*Shopper’s Card Acceptance + b2*c_Income50 + b3*c_Income75 + b4*Shopper’s Card Acceptance*c_Income50 + b5*Shopper’s Card Acceptance*c_Income75 + e

Importantly, c_Income50 and c_Income75 are centered at the sample mean. In this modified model, b1 is the regression estimator of the average causal effect.

Now I estimate the above model using regression and get the following:

Amazing 🤩 We tweaked our model and found a way to get the true Average Causal Effect by regression!

Implementing it in R

If you think implementing the centering will be a painstaking task when you have many covariates, here is some good news! We can implement the whole thing easily in R using the marginaleffects package. Here is the code:

library(dplyr)
library(marginaleffects)
library(readr)
#Import the dataset
data <- read_csv("https://raw.githubusercontent.com/vivdas92/Medium-Blogs/main/regression_toy_example_data.csv")
#Estimate the model
model<-lm(Monthly_Spending~Shoppers_Card+Income50+Income75+Shoppers_Card:Income50+Shoppers_Card:Income75,data=data)
summary(model)
#Average Causal Effect
marginaleffects(model, variables = "Shoppers_Card") %>%
summary()

Here is the R output:

When Does the Regression Estimator Estimate the Average Causal Effect?

From the above example, we can make two observations:

(1) Conventional regression estimator estimates the true Average Causal Effect if the effect within each level of the added covariates (delta_x) is the same (homogenous treatment effect). In our example, if the effect within each of the three income categories were 10, then:

Average Causal Effect = 10*0.3125+10*0.375+10*0.3125=10

delta_Regression = (10*0.05 +10*0.09375+10*0.05)/ (0.05+0.09375+0.05) = 10

So, if you are comfortable making the assumption that the treatment effect is homogenous across the levels of covariates added to the model, then you can claim that the conventional regression estimator estimates the true average causal effect.

(2) Conventional regression estimator estimates the true Average Causal Effect if the variance of treatment status within each level of the added covariates is the same. For example, let’s look at the following dataset:

With this data, if you estimate the conventional model without interactions using regression, you get the true Average Causal Effect (which is 18.11). Interestingly, here the causal effect is not homogenous across the income categories; however, the variance of Shopper’s Card Acceptance is the same across all three income categories (0.3).

In general, the conventional regression estimator estimates the true Average Causal Effect if any of the two conditions are met:

(1) treatment effect homogeneity

(2) equal variance of the treatment status within the levels of the covariates

Another interesting observation we can make from the above dataset is that it looks exactly like block-randomized experimental data (or perfectly balanced data). Here, apparently, Income is not even a confounder! Why? Because all three income categories are equally likely to accept the shopper’s card. So this implies that if you are not willing to estimate the model with the interactions, then you have to find a way to achieve this kind of balance in your non-experimental dataset. Two popular ways of achieving a balanced analytical sample are (1) propensity score matching and (2) inverse probability weighting.

Final Thoughts

Regression has always been an amazing tool in the analyst’s toolbox and it will continue to be one in the future. Despite all the limitations, if properly modeled, regression can be the most appropriate estimation technique in many empirical contexts (especially if you are trying to estimate the Average Causal Effect of doing something).

I will finish by reiterating a key point that several methodologists have recommended over the years. Rather than thinking of any empirical method as a magic bullet and hoping it will work automatically as intended — at the beginning of any analysis — it is a good idea to ask ourselves these two questions:

  1. What are we trying to estimate?
  2. Are we estimating the thing that we believe we are estimating?

Once we figure out the answers to these questions, the analysis plan should be made accordingly!

*Unless otherwise noted, all the images are by the author. These are screenshots from MS Excel, MS Word, and R.

If you would like to read some of my previous posts on how to attempt to know the unknown, here are some suggestions:

References

Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.

Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychological Methods, 13(4), 279.

--

--

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Wellbeing | Ph.D. candidate @UW-Madison | More: https://vivekanandadas.com