Generalized linear models are another way to approach linear regression. The advantage of of GLM is that allows the error to follow many different distributions rather than only the normal distribution which is an assumption of traditional linear regression.
Often GLM is used for response or dependent variables that are binary or represent count data. THis post will provide a brief explanation of GLM as well as provide an example.
Key Information
There are three important components to a GLM and they are
- Error structure
- Linear predictor
- Link function
The error structure is the type of distribution you will use in generating the model. There are many different distributions in statistical modeling such as binomial, gaussian, poission, etc. Each distribution comes with certain assumptions that govern their use.
The linear predictor is the sum of the effects of the independent variables. Lastly, the link function determines the relationship between the linear predictor and the mean of the dependent variable. There are many different link functions and the best link function is the one that reduces the residual deviances the most.
In our example, we will try to predict if a house will have air conditioning based on the interactioon between number of bedrooms and bathrooms, number of stories, and the price of the house. To do this, we will use the “Housing” dataset from the “Ecdat” package. Below is some initial code to get started.
library(Ecdat)
data("Housing")
The dependent variable “airco” in the “Housing” dataset is binary. This calls for us to use a GLM. To do this we will use the “glm” function in R. Furthermore, in our example, we want to determine if there is an interaction between number of bedrooms and bathrooms. Interaction means that the two independent variables (bathrooms and bedrooms) influence on the dependent variable (aircon) is not additive, which means that the combined effect of the independnet variables is different than if you just added them together. Below is the code for the model followed by a summary of the results
model<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price, family=binomial)
summary(model)
##
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms +
## Housing$stories + Housing$price, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7069 -0.7540 -0.5321 0.8073 2.4217
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.441e+00 1.391e+00 -4.632 3.63e-06
## Housing$bedrooms 8.041e-01 4.353e-01 1.847 0.0647
## Housing$bathrms 1.753e+00 1.040e+00 1.685 0.0919
## Housing$stories 3.209e-01 1.344e-01 2.388 0.0170
## Housing$price 4.268e-05 5.567e-06 7.667 1.76e-14
## Housing$bedrooms:Housing$bathrms -6.585e-01 3.031e-01 -2.173 0.0298
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 681.92 on 545 degrees of freedom
## Residual deviance: 549.75 on 540 degrees of freedom
## AIC: 561.75
##
## Number of Fisher Scoring iterations: 4
To check how good are model is we need to check for overdispersion as well as compared this model to other potential models. Overdispersion is a measure to determine if there is too much variablity in the model. It is calcualted by dividing the residual deviance by the degrees of freedom. Below is the solution for this
549.75/540
## [1] 1.018056
Our answer is 1.01, which is pretty good because the cutoff point is 1, so we are really close.
Now we will make several models and we will compare the results of them
Model 2
#add recroom and garagepl
model2<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, family=binomial)
summary(model2)
##
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms +
## Housing$stories + Housing$price + Housing$recroom + Housing$garagepl,
## family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6733 -0.7522 -0.5287 0.8035 2.4239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.369e+00 1.401e+00 -4.545 5.51e-06
## Housing$bedrooms 7.830e-01 4.391e-01 1.783 0.0745
## Housing$bathrms 1.702e+00 1.047e+00 1.626 0.1039
## Housing$stories 3.286e-01 1.378e-01 2.384 0.0171
## Housing$price 4.204e-05 6.015e-06 6.989 2.77e-12
## Housing$recroomyes 1.229e-01 2.683e-01 0.458 0.6470
## Housing$garagepl 2.555e-03 1.308e-01 0.020 0.9844
## Housing$bedrooms:Housing$bathrms -6.430e-01 3.054e-01 -2.106 0.0352
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 681.92 on 545 degrees of freedom
## Residual deviance: 549.54 on 538 degrees of freedom
## AIC: 565.54
##
## Number of Fisher Scoring iterations: 4
#overdispersion calculation
549.54/538
## [1] 1.02145
Model 3
model3<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl, family=binomial)
summary(model3)
##
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms +
## Housing$stories + Housing$price + Housing$recroom + Housing$fullbase +
## Housing$garagepl, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6629 -0.7436 -0.5295 0.8056 2.4477
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.424e+00 1.409e+00 -4.559 5.14e-06
## Housing$bedrooms 8.131e-01 4.462e-01 1.822 0.0684
## Housing$bathrms 1.764e+00 1.061e+00 1.662 0.0965
## Housing$stories 3.083e-01 1.481e-01 2.082 0.0374
## Housing$price 4.241e-05 6.106e-06 6.945 3.78e-12
## Housing$recroomyes 1.592e-01 2.860e-01 0.557 0.5778
## Housing$fullbaseyes -9.523e-02 2.545e-01 -0.374 0.7083
## Housing$garagepl -1.394e-03 1.313e-01 -0.011 0.9915
## Housing$bedrooms:Housing$bathrms -6.611e-01 3.095e-01 -2.136 0.0327
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 681.92 on 545 degrees of freedom
## Residual deviance: 549.40 on 537 degrees of freedom
## AIC: 567.4
##
## Number of Fisher Scoring iterations: 4
#overdispersion calculation
549.4/537
## [1] 1.023091
Now we can assess the models by using the “anova” function with the “test” argument set to “Chi” for the chi-square test.
anova(model, model2, model3, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories +
## Housing$price
## Model 2: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories +
## Housing$price + Housing$recroom + Housing$garagepl
## Model 3: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories +
## Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 540 549.75
## 2 538 549.54 2 0.20917 0.9007
## 3 537 549.40 1 0.14064 0.7076
The results of the anova indicate that the models are all essentially the same as there is no statistical difference. The only criteria on which to select a model is the measure of overdispersion. The first model has the lowest rate of overdispersion and so is the best when using this criteria. Therefore, determining if a hous has air conditioning depends on examining number of bedrooms and bathrooms simultenously as well as the number of stories and the price of the house.
Conclusion
The post explained how to use and interpret GLM in R. GLM can be used primarilyy for fitting data to disrtibutions that are not normal.