Gradient Boosting With Random Forest Classification in R

In this blog, we have already discussed and what gradient boosting is. However, for a brief recap, gradient boosting improves model performance by first developing an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

What follows next is that gradient boosting looks at the error in the first model and develops a second model using what is called the loss function. The loss function calculates the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process is repeated with the creation of additional models until a certain level of accuracy or reduction in error is attained.

This post what provide an example of the use of gradient boosting in random forest classification. Specifically, we will try to predict a person’s labor participation based on several independent variables.

library(randomForest);library(gbm);library(caret);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Data Preparation

We need to transform the ‘age’ variable by multiplying by ten so that the ages are realistic. In addition, we need to convert “lnnlinc” from the log of salary to regular salary. Below is the code to transform these two variables.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

We can now create our train and test datasets

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We now need to create our grid and control. The grid allows us to create several different models with various parameter settings. This is important in determining what is the most appropriate model which is always determined by comparing. We are using random forest so we need to set the number of trees we desire, the depth of the trees, the shrinkage which controls the influence of each tree, and the minimum number of observations in a node. The control will allow us to set the cross-validation. Below is the code for the creation of the grid and control.

grid<-expand.grid(.n.trees=seq(200,500,by=200),.interaction.depth=seq(1,3,by=2),.shrinkage=seq(.01,.09,by=.04),
                   .n.minobsinnode=seq(1,5,by=2)) #grid features
control<-trainControl(method="CV",number = 10) #control

Parameter Selection

Now we set our seed and run the gradient boosted model.

set.seed(123)
gbm.lfp.train<-train(lfp~.,data=train,method='gbm',trControl=control,tuneGrid=grid)
gbm.lfp.train
## Stochastic Gradient Boosting 
## 
## 636 samples
##   6 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 573, 573, 571, 572, 573, 572, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
##   0.01       1                  1               200      0.6666026
##   0.01       1                  1               400      0.6823306
##   0.01       1                  3               200      0.6588637
##   0.01       1                  3               400      0.6854804
##   0.01       1                  5               200      0.6792769
##   0.01       1                  5               400      0.6823306
##   0.01       3                  1               200      0.6730044
##   0.01       3                  1               400      0.6572051
##   0.01       3                  3               200      0.6793273
##   0.01       3                  3               400      0.6697787
##   0.01       3                  5               200      0.6682914
##   0.01       3                  5               400      0.6650416
##   0.05       1                  1               200      0.6759558
##   0.05       1                  1               400      0.6508040
##   0.05       1                  3               200      0.6681426
##   0.05       1                  3               400      0.6602286
##   0.05       1                  5               200      0.6680441
##   0.05       1                  5               400      0.6570788
##   0.05       3                  1               200      0.6493662
##   0.05       3                  1               400      0.6603518
##   0.05       3                  3               200      0.6540545
##   0.05       3                  3               400      0.6366911
##   0.05       3                  5               200      0.6712428
##   0.05       3                  5               400      0.6445299
##   0.09       1                  1               200      0.6461405
##   0.09       1                  1               400      0.6634768
##   0.09       1                  3               200      0.6571036
##   0.09       1                  3               400      0.6320765
##   0.09       1                  5               200      0.6554922
##   0.09       1                  5               400      0.6540755
##   0.09       3                  1               200      0.6523920
##   0.09       3                  1               400      0.6430140
##   0.09       3                  3               200      0.6430666
##   0.09       3                  3               400      0.6447749
##   0.09       3                  5               200      0.6540522
##   0.09       3                  5               400      0.6524416
##   Kappa    
##   0.3210036
##   0.3611194
##   0.3032151
##   0.3667274
##   0.3472079
##   0.3603046
##   0.3414686
##   0.3104335
##   0.3542736
##   0.3355582
##   0.3314006
##   0.3258459
##   0.3473532
##   0.2961782
##   0.3310251
##   0.3158762
##   0.3308353
##   0.3080692
##   0.2940587
##   0.3170198
##   0.3044814
##   0.2692627
##   0.3378545
##   0.2844781
##   0.2859754
##   0.3214156
##   0.3079460
##   0.2585840
##   0.3062307
##   0.3044324
##   0.3003943
##   0.2805715
##   0.2827956
##   0.2861825
##   0.3024944
##   0.3002135
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 400,
##  interaction.depth = 1, shrinkage = 0.01 and n.minobsinnode = 3.

Gradient boosting provides us with the recommended parameters for our training model as shown above as well as the accuracy and kappa of each model. We also need to recode the dependent variable as 0 and 1 for the ‘gbm’ function.

Model Training

train$lfp=ifelse(train$lfp=="no",0,1)
gbm.lfp<-gbm(lfp~., distribution = 'bernoulli',data=train,n.trees = 400,interaction.depth = 1,shrinkage=.01,n.minobsinnode = 3)

You can see a summary of the most important variables for prediction as well as a plot by using the “summary” function.

summary(gbm.lfp)

1.png

##             var   rel.inf
## lnnlinc lnnlinc 28.680447
## age         age 27.451474
## foreign foreign 23.307932
## nyc         nyc 18.375856
## educ       educ  2.184291
## noc         noc  0.000000

Salary (lnnlinc), age and foreigner status are the most important predictors followed by the number of younger children (nyc) and last education. The number of older children (noc) has no effect. We can now test our model on the test set.

Model Testing

gbm.lfp.test<-predict(gbm.lfp,newdata = test,type = 'response', n.trees = 400)

Our test model returns a set of probabilities. We need to convert this to a simple yes or no and this is done in the code below.

gbm.class<-ifelse(gbm.lfp.test<0.5,'no','yes')

We can now look at a table to see how accurate our model is as well as calculate the accuracy.

table(gbm.class,test$lfp)
##          
## gbm.class no yes
##       no  91  39
##       yes 39  67
(accuracy<-(91+67)/(91+67+39+39))
## [1] 0.6694915

The model is not great. However, you now have an example of how to use gradient boosting to develop a random forest classification model

Leave a Reply