The world’s leading publication for data science, AI, and ML professionals.

How to Interpret Linear Regression Coefficients | Complete Guide

A complete guide from simple to advanced models

Photo by Vitalii Khodzinskyi from Unsplash
Photo by Vitalii Khodzinskyi from Unsplash

Looking online at how to interpret Linear Regression coefficients is similar to looking at how to import a CSV file in Python, many people can get this information in their heads. After teaching statistics to over 10,000 students for a decade, I still sometimes have to double-check the interpretation in some special cases (e.g., a binary result and a log-transformed explanatory variable). That is why I decided to write this article which contains a large list of different linear regression models and explains the interpretation of the coefficients in each situation, including log-transformed variables, binary variables or interaction terms.

Note that to fully understand (but it is not necessary) the content of this article you need to be familiar with two mathematical concepts: partial derivative and conditional expectation.

Before going through the list of different cases, let me introduce some important definitions and considerations (model definition, ceteris paribus, binary variables, multicollinearity).

Table of content:

0. Definitions and important considerations

1. Intercept

2. Continuous dependent variable

2.1 Continous independent variable

2.1.a lin-lin: Linear outcome, linear independent variable

2.1.b log-lin: Log transformed outcome, linear independent variable

2.1.c lin-log: Linear outcome, Log transformed independent variable

2.1.d log-log: Log transformed outcome, Log transformed independent variable

**** 2.2 Binary independent variable

2.2.a Linear dependent variable

2.2.b Log transformed dependent variable 3. Binary dependent variable

3.1 Continous independent variable

3.1.a Linear independent variable

3.1.b Log transformed independent variable

3.2 Binary independent variable

4. Interaction effects

4.1 Quadratic effect

4.2 Interaction between two continous variables

4.3 Interaction between two binary variables

Special case Difference-in-Difference


0. Definitions and important considerations

General definition

Definition of the model:

Yᵢ = β₀ + β₁ Xᵢ + Z’ λ+ ϵᵢ

Yᵢ is the dependent variable, Xᵢ is an independent continous variable, Z a vector of control variables, and ϵᵢ an error term.

Analytically, β₀ is the intercept of the function, and β₁ is a slope parameter. Hence β₁ represents the change in Y after a unit increase of X everything else equals (keeping the values in Z fixed).

Formally we can write β₁ = ∂ Y / ∂ X (the partial derivative of Y with respect to X).

Ceteris paribus

In multiple linear regressions, we use the term ceteris paribus, which means "all other things being equal". This is a direct consequence of the definition above. The partial derivative measures the changes in a function due to a change in one variable, with all other variables remaining fixed. Thus, in the model above, β₁ represents the effect of a change in X on Y while keeping everything else in the vector of control variables Z fixed.

Binary variables

The dependent or independent variable can be a binary variable, i.e. a variable taking either the value 1 or the value 0.

When the independent variable is a binary variable, the coefficient should be interpreted as a difference in expectation. Suppose that Dᵢ is a binary variable taking the value 1 if the person in the dataset is an adult (Age≥21) and 0 otherwise. The model would be Yᵢ = β₀ + β₁ Dᵢ + Z’ λ+ ϵᵢ. According to the general definition of the interpretation of the coefficient given above (partial derivative), a change of one unit here means a change from childhood to adulthood. Therefore, the coefficient should be interpreted as the ‘average’ difference in Y between children and adults. Formally, β₁ = E[Y |D=1,Z]-E[Y |D=0,Z]. In other words, β₁ represents the difference between the ‘average’ value of Y for adults and the average value of Y for children.

When the dependent variable is a binary variable, we are in the case of a Linear Probability Model (LPM). The coefficients of the regression represent a change in the probability that the dependent variable is equal to 1. LPMs are often neglected, mainly because these models can predict negative probabilities or probabilities greater than 1. To avoid this problem, we should rely on logit (logistic regression) or probit models. However, LPMs are still often used as the coefficients are easy to interpret (see section 3). In particular, with fixed effects, which are widely used in econometrics, LPMs is more ‘suitable’ (cf: Estimating Fixed Effects Logit Models with Large Panel Data).

Perfect multicollinearity

If two or more independent variables can be expressed as an exact linear relationship to each other, we have a perfect multicollinearity problem. (Technically, the matrix of explanatory variables would not be of full rank and thus would not be invertible, which is essential for estimating the regression parameters.) For binary and categorical variables, this fact leads to an important point. You cannot include binary variables capturing all the different possible categories, otherwise you would have a perfect multicollinearity problem. For example, in the same regression, you cannot include a binary variable for adults and non-adults. In fact, this is not a problem because the regression coefficient associated with the variable (here Adult) represents the difference with the reference category (the one that is excluded, here Non-Adult). So it is not useful to have both.

Now, if you have more than two categories, for example Light, Medium and Heavy, you cannot have one binary variable for all of them. You have to exclude one of them.

Example:

Yᵢ = β₀ + β1₁ Mediumᵢ + β₂ Heavy + Z’ λ + ϵᵢ

Here we excluded the category Light. Hence, β₀ represents actually the expectation of Y for the observations belonging to the category Light. The excluded category will become the reference category, meaning that the coefficients will always represent the difference in expectations compared to this category:

β₁ = E[Y|Medium=1,Z]-E[Y |Light=1,Z]

β₂ = E[Y|Heavy=1,Z]-E[Y |Light=1,Z]

Log transformation

Why do we log transform?

Note that the log transformation is often used for linear regression. Linear regression measures the average effect, and so when a variable is highly skewed on the right, a common approach is to apply the natural logarithm to transform it. This strategy aims to reduce the skewness and therefore allows the mean to be used. My personal rule is that if the skewness (a measure of skewness) is greater than 3, I log transform the variable.

What should I look out for with the log transformation?

When a variable is transformed into a logarithm, the interpretation of the coefficient changes. This is not necessarily a bad thing and may even be beneficial. When a variable is log-transformed, the linear regression coefficient can be interpreted as an elasticity or semi-elasticity. As we will see later, this means that instead of looking at changes in the unit of the variable, we look at changes in percentages. In the case of a level regression (no log transformation), with the regression coefficients corresponding to a partial derivative (∂ Y / ∂ X ), a change of one unit in X implies a change of β₁ unit in Y (with Y the dependent variable, X the independent variable, and β₁ the regression coefficient associated with X). When both variables (dependent and independent) are log-transformed, we interpret the regression coefficient approximately as an elasticity: a 1% change in X implies a β₁% change in Y.

Careful Exploratory Data Analysis (EDA) should be made before and after the transformation to make the right decision. You can follow my EDA recipe here: https://medium.com/towards-data-science/a-recipe-to-empirically-answer-any-question-quickly-22e48c867dd5 . In addition, please refer to "Log-transformation and its implications for data analysis" (Feng et al. (2002)) for more details on the potential issues.


1. Intercept

Definition of the model:

Yᵢ = β₀ + β₁ Xᵢ + Z’ λ+ ϵᵢ

Y is the dependent variable, X is an independent variable, Z is a vector of control variables, and ϵᵢ an error term.

Interpretation: β₀ is the expected value of Y if all the other variables are set to 0. Note that if the explanatory variable(s) never equal zero (e.g. height, GDP or whatever), it makes no sense to interpret this coefficient.


2. Continuous dependant variable

In this Section 2, the dependant variable Yᵢ is always continuous.

2.1 Continous independent variable

In this sub-Section 2.1, the independent variable Xᵢ is always continuous.

2.1.a level-level: Level outcome, level independent variable

Definition of the model:

Yᵢ = β₀ + β₁ Xᵢ + Z’ λ+ ϵᵢ

Y is the dependent variable, X is an independent variable, and ϵ an error term.

Interpretation: A one-unit increase of X implies a β₁ unit change of Yᵢ on average (ceteris paribus, everything else held constant).

2.1.b log-level: Log transformed outcome, level independent variable

Definition of the model:

log(Yᵢ) = β₀ + β₁ Xᵢ + Z’ λ+ ϵᵢ

log(Yᵢ) is the log-transformed dependent variable, Xᵢ is an independent variable, and ϵᵢ an error term.

Interpretation: A one unit increase of X implies a (exp(β₁)-1)100 percent change of Y on average (ceteris paribus, everything else held constant). For a quick approximation, you can interpret the coefficient as a semi-elasticity: A one-unit increase of X implies a 100β₁ % change in Y on average (ceteris paribus, everything else held constant).

2.1.c level-log: Level outcome, Log transformed independent variable

Definition of the model:

Yᵢ = β₀ + β₁ log(Xᵢ) + Z’ λ+ ϵᵢ

Yᵢ is the dependent variable, log(Xᵢ) is a log-transformed independent variable, and ϵᵢ an error term.

Interpretation: A one percent increase of X implies a β₁*log(1.01) change of Y on average (ceteris paribus, everything else held constant). For a quick approximation, you can interpret the coefficient as a semi-elasticity: A one percent increase of X implies a β₁ / 100 unit change in Y on average (ceteris paribus, everything else held constant).

2.1.d log-log: Log transformed outcome, Log transformed independent variable

Definition of the model:

log(Yᵢ) = β₀ + β₁ log(Xᵢ) + Z’ λ+ ϵᵢ

log(Yᵢ) is the log-transformed dependent variable, log(Xᵢ) is a log-transformed independent variable, and ϵᵢ an error term.

Interpretation: A one percent increase of X implies a (1.01^β₁–1) * 100 percent change of Yᵢ on average (ceteris paribus, everything else held constant). For a quick approximation, you can interpret the coefficient as an elasticity: A one percent increase of X implies a β₁ percent % in Y on average (ceteris paribus, everything else held constant).


2.2 Binary independent variable

In this sub-Section 2.2, the independent variable Dᵢ is a binary variable taking only either the value 1 OR the value 0.

2.2.a Level dependent variable

Definition of the model:

Yᵢ = β₀ + β₁ Dᵢ+ Z’ λ+ ϵᵢ

Yᵢ is the dependent variable, Dᵢ is an independent binary variable, Z is a vector of control variables, and ϵᵢ an error term.

Interpretation: Recall that formally β₁ = E[Yᵢ |Dᵢ=1,Z]-E[Yᵢ |Dᵢ=0,Z]. In other words, the difference in expectation when Dᵢ changes from 0 to 1 is equal to β₁, everything else is equal.

To make it more concrete let me use the following example:

HoursOfSleepᵢ = β₀ + β₁ Adultᵢ+ Z’ λ+ ϵᵢ.

In this example, β₁ represents the "average" difference of hours of sleep between adults (when Adult = 1) and non-adults (aka children, when Adult = 0) everything else equal.

2.2.b Log transformed dependent variable

Definition of the model:

log(Yᵢ) = β₀ + β₁ Dᵢ+ Z’ λ+ ϵᵢ

log(Yᵢ) is the log-transformed dependent variable, Dᵢ is an independent binary variable, Z is a vector of control variables, and ϵᵢ an error term.

Interpretation: Recall that the coefficient of a binary variable represents a difference in "means" (conditional expectations). However, here due to the log transformation we have:

β₁ = log(E[Yᵢ |Dᵢ=1,Z]) – log(E[Yᵢ |Dᵢ=0,Z]) = log(E[Yᵢ |Dᵢ=1,Z]/E[Yᵢ |Dᵢ=0,Z])

⇔ exp(β₁) = E[Yᵢ |Dᵢ=1,Z]/E[Yᵢ |Dᵢ=0,Z]

To make it more concrete let me use the following example: log(HoursOfSleepᵢ) = β₀ + β₁ Adultᵢ+ Z’ λ+ ϵᵢ. In this example, exp(β₁) represents the ratio of the mean hours of sleep for adults (when Adultᵢ = 1) over non-adults (aka children, when Adultᵢ = 0) everything else equal. If exp(β₁) = 1.1 it would mean that the adults have 10% more hours of sleep compared to children. While if exp(β₁) = 1.5 it would mean that the adults have 50% more hours of sleep compared to children.

Note that in this context the mean is the geometric mean (for more details see: https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-a-regression-model-when-some-variables-are-log-transformed/).


3. Binary dependent variable

In this Section 3, the dependent variable Dᵢ is always binary (taking the value 1 or 0). In this case, the model is called a Linear Probability Model (see the note section 0. for more details).

3.1 Continous independent variable

In this sub-Section 3.1, the independent variable Xᵢ is a continous variable.

3.1.a Level independent variable

Definition of the model:

Dᵢ = β₀ + β₁ Xᵢ + Z’ λ+ ϵᵢ

Dᵢ is the binary dependent variable, Xᵢ is an independent continuous variable, Z is a vector of control variables, and ϵᵢ an error term.

Interpretation: A one-unit increase of X implies a change in the probability that D = 1 of β₁ on average (ceteris paribus, everything else held constant). For example, if β₁=0.1 it means that the probability of D being equal to 1 increases by 0.1 on average (ceteris paribus, everything else held constant).

3.1.b Log transformed independent variable

Definition of the model:

Dᵢ = β₀ + β₁ log(Xᵢ) + Z’ λ+ ϵᵢ

Dᵢ is the binary dependent variable, log(Xᵢ) is a log transformed independent continuous variable, Z is a vector of control variables, and ϵᵢ an error term.

Interpretation: A one-percent increase of X implies a change in the probability that D = 1 of β₁ unit on average (ceteris paribus, everything else held constant). For example, if β₁=0.1 it means that the probability of D being equal to 1 increases 0.1 on average following a one percent increase of X (ceteris paribus, everything else held constant).

3.2 Binary independent variable

Definition of the model:

Dᵢ = β₀ + β₁ Bᵢ + Z’ λ+ ϵᵢ

Dᵢ is the binary dependent variable, Bᵢ is an independent binary variable, Z is a vector of control variables, and ϵᵢ an error term.

Interpretation: The probability that D = 1 when B changes from 0 to 1 changes by β₁, everything else equal. For example, if β₁ = 0.1 it means that the probability of D equal to 1 increases by 0.1 on average when B changes from 0 to 1 (ceteris paribus, everything else held constant).

To make it more concrete let me use the following example:

Insomniaᵢ = β₀ + β₁ Adultᵢ + Z’ λ+ ϵᵢ.

Insomnia is a binary variable taking the value 1 if individual "i" suffers from insomnia and 0 otherwise. Adult is a binary variable taking the value 1 if individual "i" is strictly older than 20 years old and 0 otherwise. In this example, if β₁ = 0.1 it means that the probability of an Adult (Adult = 1) suffering from insomnia is larger than for a Child (Adult = 0) by 0.1 on average, everything else equal.


4. Interaction effects

Linear regressions are linear in parameters, which does not prevent the estimation of a non-linear function. We will see three different types of interactions. For the sake of parsimony, I will only use one continuous dependent variable in this section: Y.

4.1 Quadratic effect

Definition of the model:

Yᵢ = β₀ + β₁ Xᵢ + β₂ Xᵢ * Xᵢ+ Z’ λ+ ϵᵢ

or

Yᵢ = β₀ + β₁ Xᵢ + β₂ Xᵢ²+ Z’ λ+ ϵᵢ

Yᵢ is a continuous dependent variable, Xᵢ is an independent continuous variable, Z is a vector of control variables, and ϵᵢ an error term. This model includes a polynomial function of order 2 for X. Linear regressions could include higher-order polynomial functions.

Interpretation: The interpretation is more complex with polynomial form because the partial derivative (the marginal effect), is not constant anymore. In the current situation, ∂ Y / ∂ X = β₁ + 2 * β₂ Xᵢ. As the marginal effect depends on the values of X, we must evaluate the marginal effect for different meaningful values of X.

To do so, I compute and plot the marginal effect of changing X on Y for different values of X. In STATA you could use the commands margins and marginsplot, in R you can use marginaleffects, while in Python I use the following code:

The graph above reveals the quadratic relationship between X and Y. Hence, the marginal effect of X on Y is initially negative and then it becomes positive. The code below allows us to compute the marginal effect in this particular case. Note that I computed the derivative by hand for this second-degree polynomial function as you can see from the first line of code.

We can see from the graph above that for values approximately below -1 the marginal is negative and then it becomes positive.

4.2 Interaction between two continous variables

In some situations we expect two different variables to interact together. In one of my recent publications published in PNAS, jointly with my colleague Pr.Rohner, we explored the relationship between a strategic position to control maritime trade, trade openness, and conflicts (https://www.pnas.org/doi/abs/10.1073/pnas.2105624118?doi=10.1073%2Fpnas.2105624118). Trade openness and the strategic value of a position have a direct effect on the probability of conflict. But there is also an important joint effect (an interaction), as shown in the series of graphs below.

Image by author. The figure reveals the effect of the proximity to a main strategic position to control maritime trade routes (X-axis) on the probability of conflict (Y-axis). The observations are split by quartile of proximity (q1, q2, q3 and q4), meaning that q4 are the closest regions to strategic positions.
Image by author. The figure reveals the effect of the proximity to a main strategic position to control maritime trade routes (X-axis) on the probability of conflict (Y-axis). The observations are split by quartile of proximity (q1, q2, q3 and q4), meaning that q4 are the closest regions to strategic positions.

The graph above revealed that the closer we are to a strategic location, the higher the risk of conflict.

Image by author. The figure reveals the relationship of trade openness (X-axis) on the probability of conflict (Y-axis). The observations are split by quartile of trade openness (q1, q2, q3 and q4).
Image by author. The figure reveals the relationship of trade openness (X-axis) on the probability of conflict (Y-axis). The observations are split by quartile of trade openness (q1, q2, q3 and q4).

This second graph shows that the greater the trade openness, the higher the risk of conflict. Thus, years of trade booms have a higher risk of conflict.

Image by author. This bar graph represent the relationship between distance to a strategic position divided in quartiles (near_dist_q), the trade openness during the year (tradew_q) and the probability of conflict (y-axis).
Image by author. This bar graph represent the relationship between distance to a strategic position divided in quartiles (near_dist_q), the trade openness during the year (tradew_q) and the probability of conflict (y-axis).

This last graph shows how our two explanatory variables interact. On the one hand, we can see in the first group of bars on the left that locations far from strategic positions for maritime trade have a higher risk of conflict during periods of trade expansion (yellow bar). On the other hand, we can see that this relationship is reversed for locations close to strategic positions (group of bars on the far right).

Hence, the marginal effect of the distance to strategic positions on the probability of conflict change as a function of trade openness. To model this effect we must use an interaction effect.

Definition of the model:

Yᵢ = β₀ + β₁ Xᵢ + β₂ Zᵢ + β₃ Zᵢ*Xᵢ + ϵᵢ

Yᵢ is a continuous dependent variable, Xᵢ and Zᵢ are independent continuous variables, and ϵᵢ an error term. It is very important to note that when we have an interaction effect, we must include every "main" term as well in the model. For example, if we want to have an interaction between X and Z, we must also include X and Z alone in the regression.

Interpretation: To understand how to interpret such an effect we have to go back to the definition of the marginal effect, which is a partial derivative. In our case there are two different partial derivatives including β₃:

∂ Y / ∂ X = β₁ + β₃ Zᵢ

∂ Y / ∂ Z = β₁ + β₃ Xᵢ

Hence, in this situation, as in the polynomial situation, we have to evaluate the marginal effect for a set of meaningful values of Z or X as the marginal effect is a function of those variables. To do so, I use in STATA the commands margins and marginsplot, in R marginaleffects, while in Python I use the following code:

The figure above plots the regression plane (not anymore regression line as Y is a function of X and Z). The following code will allow us to plot the marginal effect of X on Y as a function of Z values.

We can see from the graph above that the marginal effect of x on y is negative when z is approximately lower than -1 while it becomes positive for larger values.

4.3 Interaction between two binary variables

As in the previous section, we sometimes have binary variables that interact with each other. A famous example of this is wage discrimination. Let us imagine a correlational model to calculate the average wage differences between men and women, whites and non-whites. In this model it is important to include an interaction term if one wants to test the hypothesis that wage discrimination against non-white women is different (potentially larger) than the sum of discrimination against women and non-whites.

Model:

Wageᵢ = β₀ + β₁ Womanᵢ + β₂ NonWhiteᵢ + β₃ Womanᵢ * NonWhiteᵢ + ϵᵢ

Wageᵢ is the hourly wage, Womanᵢ is a binary variable taking the value 1 if individual "i" is female, NonWhiteᵢ is a binary variable taking the value 1 if individual "i" is non-white, and ϵᵢ an error term. It is very important to note that when we have an interaction effect, we must also include each "main" term in the model. For example, if we want to get an interaction between X and Z, we must also include X and Z alone in the regression (here we include the interaction but also Woman and NonWhite separately).

Interpretation: First, let us interpret the main effects:

β₁ = E[Wageᵢ | Womanᵢ = 1, NonWhiteᵢ = 0] – E[Wageᵢ | Womanᵢ = 0, NonWhiteᵢ = 0]

β₁ captures the average wage difference between white women and white men. Note that the other terms are 0 as NonWhite is set to 0 (including the interaction).

β₂ = E[Wageᵢ | NonWhiteᵢ = 1, Womanᵢ = 0] – E[Wageᵢ | NonWhiteᵢ = 0, Womanᵢ = 0]

β₂ captures the average wage difference between non-white men and non-white men. Note that the other terms are 0 as Woman is set to 0 (including the interaction).

β₃ =

(E[Wageᵢ | NonWhiteᵢ = 1, Womanᵢ = 1] – E[Wageᵢ | NonWhiteᵢ = 0, Womanᵢ = 1])

(E[Wageᵢ | NonWhiteᵢ = 1, Womanᵢ = 0] – E[Wageᵢ | NonWhiteᵢ = 0, Womanᵢ = 0])

Finally, β₃ is the additional wage penality (assuming that the coefficient is negative) for being non-white AND a woman.

Special case: Difference-in-Difference

There is another common situation in which we use binary variable interactions. A quasi-experimental technique widely used in econometrics called difference-in-difference. This strategy aims to measure the causal effect of a policy, for example. However, a discussion of how to assess causality in this setting is beyond the scope of this paper (see Scott Cuningham’s free e-book : "Causal Inference: The Mixtape").

Let me discuss a basic hypothetical example of a Diff-in-Diff model. In 2008, the UK implement a policy to reduce CO² emissions, while Ireland is not subject to this policy. To evaluate the effect of this policy on pollution we could set the following model:

CO²EmissionsPerCapitaᵢₜ = β₀ + β₁ UKᵢₜ + β₂ Postᵢₜ + β₃ UKᵢₜ * Postᵢₜ + ϵᵢₜ

i and t are indices for country and year respectively, CO²EmissionsPerCapitaᵢₜ is self-explanatory, UKᵢ is a binary variable taking the value one if observation i is for the UK, Postᵢ is a binary variable taking the value one if observation i is measured after the implementation of the policy (after 2008), and ϵᵢ an error term.

Interpretation:

First, let us interpret the main effects. β₁ is the average CO2 emissions per capita difference between the UK and Ireland over the whole period. β₂ is the average CO2 emissions per capita difference between the Post (after the 2008) and Pre period (both countries together). Now the important coefficient in this setup is β₃.

β₃ =

(E[CO²EmissionsPerCapitaᵢₜ| Postᵢ = 1, UKᵢ = 1] –

E[CO²EmissionsPerCapitaᵢₜ | Postᵢ = 0, UKᵢ = 1])

(E[CO²EmissionsPerCapitaᵢₜ| Postᵢ = 1, UKᵢ = 0] –

E[CO²EmissionsPerCapitaᵢₜ | Postᵢ = 0, UKᵢ = 0])

β₃ is indeed a double difference. It represents the change in UK CO² emissions after the policy is implemented compared to the change in CO² emissions in Ireland (a country that did not implement the policy). Therefore, assuming Ireland is a good counterfactual, the additional difference captured by β₃ represents the effect of the policy.


Related Articles