Interaction effect in multiple regression

Understanding interaction effect and how to identify it in a data set using Python sklearn library.

Sufyan Khot
Towards Data Science

--

Image source: www.pixabay.com

What is interaction effect?

Interaction effect is present in statistics as well in marketing. In marketing, this same concept is referred to as the synergy effect. Interaction effect means that two or more features/variables combined have a significantly larger effect on a feature as compared to the sum of the individual variables alone. This effect is important to understand in regression as we try to study the effect of several variables on a single response variable.

A linear regression equation can be expressed as follows:

Here, we try to find the linear relation between the independent variables (X₁ and X₂) with the response variable Y and ε is the irreducible error. To check whether there is any significant statistical relation between the predictor and response variables, we conduct hypothesis testing. If we conduct this test for the predictor variable X₁, we will have two hypotheses:

Null hypothesis(H₀): There is no relationship between X₁ and Y ( β₁ = 0)

Alternative hypothesis(H₁): There is a relationship between X₁and Y ( β₁≠ 0)

We then decide whether or not to reject the null hypothesis based on the p-value. P-value is the probability of the results of the test, given the null hypothesis is true.

For example, if we get a non-zero value of β₁ in our test results, this indicates that there is a relationship between X₁ and Y. But if the p-value is large, this indicates that there is a high probability that we might get a non-zero value for β₁ even when the null hypothesis is actually true. In such a case, we fail to reject the null hypothesis and conclude that there is no relation between the predictor and response variable. But if the p-value is low (generally p-value cutoff is considered to be 0.05) then even a small non-zero value of β₁ indicates a significant relation between the predictor and response variable.

If we conclude that there is a relationship between X₁ and Y, we consider that for each unit increase of X₁, Y increases/decreases by β₁ units. In the linear equation above, we assume that the effect of X₁ on Y is independent of X₂. This is also called as the additive assumption in linear regression.

But what if the effect of X₁ on Y is also dependent on X₂? We can see such relations in many business problems. Consider for example we want to find out the return on investment for two different investment types. The linear regression equation for this example will be:

In this example, there is a possibility that there would be greater profit if we invest in both types of investments partially rather than investing in one completely. For example, if we have 1000 units of money to invest, investing 500 units of money in both the investments can lead to greater profit as compared to investing 1000 units completely in either of the investment types. In such a case, investment1’s relation with ROI will be dependent on investment2. This relation can be included in our equation as follows:

In the equation above, we have included the ‘interaction” between investment1 and investment2 for the prediction of total return on investment. We can include such interactions for any linear regression equation

The above equation can be rewritten as:

Here, β₃ is the coefficient of the interaction term. Again, to verify the presence of an interaction effect in regression, we conduct a hypothesis test and check the p-value for our coefficient (in this case β₃).

Finding interaction terms in a data set using sklearn

Now let us see how we can verify the presence of interaction effect in a data set. We will be using the Auto data set as our example. The data set can be downloaded from here. Let us have a look at the data set

import pandas as pd
data = pd.read_csv('data/auto-mpg.csv')

Converting the data set to numeric and filling in the missing values

#removing irrelevant 'car name' column
data.drop('car name',axis=1,inplace=True)
#converting all columns to numeric
for col in data.columns:
data[col] = pd.to_numeric(data[col], errors ='coerce')
#replacing missing values in horsepower with its median
horse_med = data['horsepower'].median()
data['horsepower'] = data['horsepower'].fillna(horse_med)

Let us fit an OLS(Ordinary Least Squares) model on this data set. This model is present in the statsmodels library.

from statsmodels.regression import linear_model
X = data.drop('mpg', axis=1)
y = data['mpg']
model = linear_model.OLS(y, X).fit()

From this model we can get the coefficient values and also if they are statistically significant to be included in the model.

model.summary()

Below is the snapshot of the model summary.

In the above model summary, we can see that except acceleration, all other features have a p-value less than 0.05 and are statistically significant. Even if acceleration standalone is not helpful in the prediction of mpg, we are interested in finding out whether acceleration after interacting with other variables is having an effect on mpg. Also, we are interested to know the presence of all significant interaction terms.

We first need to create all possible interaction terms. This is possible in python by using PolynomialFeatures from sklearn library

from sklearn.preprocessing import PolynomialFeatures#generating interaction terms
x_interaction = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(X)
#creating a new dataframe with the interaction terms included
interaction_df = pd.DataFrame(x_interaction, columns = ['cylinders','displacement','horsepower','weight','acceleration','year','origin',
'cylinders:displacement','cylinders:horsepower','cylinders:weight','cylinders:acceleration',
'cylinders:year','cylinders:origin','displacement:horsepower','displacement:weight',
'displacement:acceleration','displacement:year','displacement:origin','horsepower:weight',
'horsepower:acceleration','horsepower:year','horsepower:origin','weight:acceleration',
'weight:year','weight:origin','acceleration:year','acceleration:origin','year:origin'])

As the new dataframe is created which includes the interaction terms, we can fit a new model to it and see which interaction terms are significant.

interaction_model = linear_model.OLS(y, interaction_df).fit()

Now we need only those interaction terms which are statistically significant (having p-value less than 0.05)

interaction_model.pvalues[interaction_model.pvalues < 0.05]

As we can see there is a presence of interaction terms. Also, acceleration alone is not significant but its interaction with horsepower and year proves to be very important for the prediction of mpg.

It is important to note that in the example above, the p-value of acceleration is high but it is included in interaction terms. In such a case, we have to include the main effects of acceleration in the model i.e. the coefficient of acceleration even when it is not statistically significant due to the hierarchy principle. Hierarchy principle states that if there are two features X₁ and X₂ in an interaction term, we have to include both of their coefficients(β₁ and β₂) in the model even when the p-values associated to them are very high.

--

--