In regression, one of the assumptions is the additive assumption. This assumption states that the influence of a predictor variable on the dependent variable is independent of any other influence. However, in practice, it is common that this assumption does not hold.
In this post, we will look at how to address violations of the additive assumption through the use of interactions in a regression model.
An interaction effect is when you have two predictor variables whose effect on the dependent variable is not the same. As such, their effect must be considered simultaneously rather than separately. This is done through the use of an interaction term. An interaction term is the product of the two predictor variables.
Let’s begin by making a regular regression model with an interaction. To do this we will use the “Carseats” data from the “ISLR” package to predict “Sales”. Below is the code.
library(ISLR);library(ggplot2)
data(Carseats)
saleslm<-lm(Sales~.,Carseats)
summary(saleslm)
##
## Call:
## lm(formula = Sales ~ ., data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8692 -0.6908 0.0211 0.6636 3.4115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6606231 0.6034487 9.380 < 2e-16 ***
## CompPrice 0.0928153 0.0041477 22.378 < 2e-16 ***
## Income 0.0158028 0.0018451 8.565 2.58e-16 ***
## Advertising 0.1230951 0.0111237 11.066 < 2e-16 ***
## Population 0.0002079 0.0003705 0.561 0.575
## Price -0.0953579 0.0026711 -35.700 < 2e-16 ***
## ShelveLocGood 4.8501827 0.1531100 31.678 < 2e-16 ***
## ShelveLocMedium 1.9567148 0.1261056 15.516 < 2e-16 ***
## Age -0.0460452 0.0031817 -14.472 < 2e-16 ***
## Education -0.0211018 0.0197205 -1.070 0.285
## UrbanYes 0.1228864 0.1129761 1.088 0.277
## USYes -0.1840928 0.1498423 -1.229 0.220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.019 on 388 degrees of freedom
## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698
## F-statistic: 243.4 on 11 and 388 DF, p-value: < 2.2e-16
The results are rather excellent for the social sciences. The model explains 87.3% of the variance in “Sales”. The current results that we have are known as main effects. These are effects that directly influence the dependent variable. Most regression models only include main effects.
We will now examine an interaction effect between two continuous variables. Let’s see if there is an interaction between “Population” and “Income”.
saleslm1<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
Urban+ShelveLoc+Population*Income, Carseats)
summary(saleslm1)
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population +
## Price + Age + Education + US + Urban + ShelveLoc + Population *
## Income, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8699 -0.7624 0.0139 0.6763 3.4344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.195e+00 6.436e-01 9.625 <2e-16 ***
## CompPrice 9.262e-02 4.126e-03 22.449 <2e-16 ***
## Income 7.973e-03 3.869e-03 2.061 0.0400 *
## Advertising 1.237e-01 1.107e-02 11.181 <2e-16 ***
## Population -1.811e-03 9.524e-04 -1.901 0.0580 .
## Price -9.511e-02 2.659e-03 -35.773 <2e-16 ***
## Age -4.566e-02 3.169e-03 -14.409 <2e-16 ***
## Education -2.157e-02 1.961e-02 -1.100 0.2722
## USYes -2.160e-01 1.497e-01 -1.443 0.1498
## UrbanYes 1.330e-01 1.124e-01 1.183 0.2375
## ShelveLocGood 4.859e+00 1.523e-01 31.901 <2e-16 ***
## ShelveLocMedium 1.964e+00 1.255e-01 15.654 <2e-16 ***
## Income:Population 2.879e-05 1.253e-05 2.298 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.013 on 387 degrees of freedom
## Multiple R-squared: 0.8751, Adjusted R-squared: 0.8712
## F-statistic: 226 on 12 and 387 DF, p-value: < 2.2e-16
The new contribution is at the bottom of the coefficient table and is the “Income:Population” coefficient. What this means is “the increase of Sales given a one unit increase in Income and Population simultaneously” In other words the “Income:Population” coefficient looks at their combined simultaneous effect on Sales rather than just their independent effect on Sales.
This makes practical sense as well. The larger the population the more available income and vice versa. However, for our current model, the improvement in the r-squared is relatively small. The actual effect is a small increase in sales. Below is a graph of income and population by sales. Notice how the lines cross. This is a visual of what an interaction looks like. The lines are not parallel by any means.
ggplot(data=Carseats, aes(x=Income, y=Sales, group=1)) +geom_smooth(method=lm,se=F)+
geom_smooth(aes(Population,Sales), method=lm, se=F,color="black")+xlab("Income and Population")+labs(
title="Income in Blue Population in Black")
We will now repeat this process but this time using a categorical variable and a continuous variable. We will look at the interaction between “US” location (categorical) and “Advertising” (continuous).
saleslm2<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
Urban+ShelveLoc+US*Advertising, Carseats)
summary(saleslm2)
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population +
## Price + Age + Education + US + Urban + ShelveLoc + US * Advertising,
## data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8531 -0.7140 0.0266 0.6735 3.3773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6995305 0.6023074 9.463 < 2e-16 ***
## CompPrice 0.0926214 0.0041384 22.381 < 2e-16 ***
## Income 0.0159111 0.0018414 8.641 < 2e-16 ***
## Advertising 0.2130932 0.0530297 4.018 7.04e-05 ***
## Population 0.0001540 0.0003708 0.415 0.6782
## Price -0.0954623 0.0026649 -35.823 < 2e-16 ***
## Age -0.0463674 0.0031789 -14.586 < 2e-16 ***
## Education -0.0233500 0.0197122 -1.185 0.2369
## USYes -0.1057320 0.1561265 -0.677 0.4987
## UrbanYes 0.1191653 0.1127047 1.057 0.2910
## ShelveLocGood 4.8726025 0.1532599 31.793 < 2e-16 ***
## ShelveLocMedium 1.9665296 0.1259070 15.619 < 2e-16 ***
## Advertising:USYes -0.0933384 0.0537807 -1.736 0.0834 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.016 on 387 degrees of freedom
## Multiple R-squared: 0.8744, Adjusted R-squared: 0.8705
## F-statistic: 224.5 on 12 and 387 DF, p-value: < 2.2e-16
Again, you can see that when the store is in the US you have to also consider the advertising budget as well. When these two variables are considered there is a slight decline in sales. What this means in practice is that advertising in the US is not as beneficial as advertising outside the US.
Below you can again see a visual of the interaction effect when the lines for US yes and no cross each other in the plot below.
ggplot(data=Carseats, aes(x=Advertising, y=Sales, group = US, colour = US)) + geom_smooth(method=lm,se=F)+scale_x_continuous(limits = c(0, 25))+scale_y_continuous(limits = c(0, 25))
Lastly, we will look at an interaction effect for two categorical variables.
saleslm3<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
Urban+ShelveLoc+ShelveLoc*US, Carseats)
summary(saleslm3)
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population +
## Price + Age + Education + US + Urban + ShelveLoc + ShelveLoc *
## US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8271 -0.6839 0.0213 0.6407 3.4537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8120748 0.6089695 9.544 <2e-16 ***
## CompPrice 0.0929370 0.0041283 22.512 <2e-16 ***
## Income 0.0158793 0.0018378 8.640 <2e-16 ***
## Advertising 0.1223281 0.0111143 11.006 <2e-16 ***
## Population 0.0001899 0.0003721 0.510 0.6100
## Price -0.0952439 0.0026585 -35.826 <2e-16 ***
## Age -0.0459380 0.0031830 -14.433 <2e-16 ***
## Education -0.0267021 0.0197807 -1.350 0.1778
## USYes -0.3683074 0.2379400 -1.548 0.1225
## UrbanYes 0.1438775 0.1128171 1.275 0.2030
## ShelveLocGood 4.3491643 0.2734344 15.906 <2e-16 ***
## ShelveLocMedium 1.8967193 0.2084496 9.099 <2e-16 ***
## USYes:ShelveLocGood 0.7184116 0.3320759 2.163 0.0311 *
## USYes:ShelveLocMedium 0.0907743 0.2631490 0.345 0.7303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 386 degrees of freedom
## Multiple R-squared: 0.8753, Adjusted R-squared: 0.8711
## F-statistic: 208.4 on 13 and 386 DF, p-value: < 2.2e-16
In this case, we can see that when the store is in the US and the shelf location is good it has an effect on Sales when compared to a bad location. The plot below is a visual of this. However, it is harder to see this because the x-axis has only two categories
ggplot(data=Carseats, aes(x=US, y=Sales, group = ShelveLoc, colour = ShelveLoc)) +
geom_smooth(method=lm,se=F)
Conclusion
Interactions effects are a great way to fine-tune a model, especially for explanatory purposes. Often, the change in r-square is not strong enough for prediction but can be used for nuanced understanding of the relationships among the variables.