Boosting in R

Boosting is a technique used to sort through many predictors in order to find the strongest through weighing them. In order to do this, you tell R to use a specific classifier such as a tree or regression model. R than makes multiple models or trees while trying to reduce the error in each model as much as possible. The weight of each predictor is based on the amount of error it reduces as an average across the models.

We will now go through an example of boosting use the “College” dataset from the “ISLR” package.

Load Packages and Setup Training/Testing Sets

First, we will load the required packages and create the needed datasets. Below is the code for this.

library(ISLR); data("College");library(ggplot2);
library(caret)
intrain<-createDataPartition(y=College$Grad.Rate, p=0.7,
 list=FALSE)
trainingset<-College[intrain, ]; testingset<-
 College[-intrain, ]

Develop the Model

We will now create the model. We are going to use all of the variables in the dataset for this example to predict graduation rate. To use all available variables requires the use of the “.” symbol instead of listing every variable name in the model. Below is the code.

Model <- train(Grad.Rate ~., method='gbm', 
 data=trainingset, verbose=FALSE)

The method we used is ‘gbm’ which stands for boosting with trees. This means that we are using the boosting feature for making decision trees.

Once the model is created you can check the results by using the following code

summary(Model)

The output is as follows (for the first 5 variables only).

                    var    rel.inf
Outstate       Outstate 36.1745640
perc.alumni perc.alumni 14.0532312
Top25perc     Top25perc 13.0194117
Apps               Apps  5.7415103
F.Undergrad F.Undergrad  5.7016602

These results tell you what the most influential variables are in predicting graduation rate. The strongest predictor was “Outstate”. This means that as the number of outstate students increases it leads to an increase in the graduation rate. You can check this by running a correlation test between ‘Outstate’ and ‘Grad.Rate’.

The next two variables are the percentage of alumni and top 25 percent. The more alumni the higher the grad rate and the more people in the top 25% the higher the grad rate.

A Visual

We will now make plot comparing the predicted grad rate with the actual grade rate. Below is the code followed by the plot.

qplot(predict(Model, testingset), Grad.Rate, data = testingset)

Rplot10

The model looks sound based on the visual inspection.

Conclusion

Boosting is a useful way to found out which predictors are strongest. It is an excellent way to explore a model to determine this for future processing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s