Boosting is a technique used to sort through many predictors in order to find the strongest through weighing them. In order to do this, you tell R to use a specific classifier such as a tree or regression model. R than makes multiple models or trees while trying to reduce the error in each model as much as possible. The weight of each predictor is based on the amount of error it reduces as an average across the models.
We will now go through an example of boosting use the “College” dataset from the “ISLR” package.
Load Packages and Setup Training/Testing Sets
First, we will load the required packages and create the needed datasets. Below is the code for this.
library(ISLR); data("College");library(ggplot2); library(caret) intrain<-createDataPartition(y=College$Grad.Rate, p=0.7, list=FALSE) trainingset<-College[intrain, ]; testingset<- College[-intrain, ]
Develop the Model
We will now create the model. We are going to use all of the variables in the dataset for this example to predict graduation rate. To use all available variables requires the use of the “.” symbol instead of listing every variable name in the model. Below is the code.
Model <- train(Grad.Rate ~., method='gbm', data=trainingset, verbose=FALSE)
The method we used is ‘gbm’ which stands for boosting with trees. This means that we are using the boosting feature for making decision trees.
Once the model is created you can check the results by using the following code
summary(Model)
The output is as follows (for the first 5 variables only).
var rel.inf Outstate Outstate 36.1745640 perc.alumni perc.alumni 14.0532312 Top25perc Top25perc 13.0194117 Apps Apps 5.7415103 F.Undergrad F.Undergrad 5.7016602
These results tell you what the most influential variables are in predicting graduation rate. The strongest predictor was “Outstate”. This means that as the number of outstate students increases it leads to an increase in the graduation rate. You can check this by running a correlation test between ‘Outstate’ and ‘Grad.Rate’.
The next two variables are the percentage of alumni and top 25 percent. The more alumni the higher the grad rate and the more people in the top 25% the higher the grad rate.
A Visual
We will now make plot comparing the predicted grad rate with the actual grade rate. Below is the code followed by the plot.
qplot(predict(Model, testingset), Grad.Rate, data = testingset)
The model looks sound based on the visual inspection.
Conclusion
Boosting is a useful way to found out which predictors are strongest. It is an excellent way to explore a model to determine this for future processing.