In this blog, we have already discussed and what gradient boosting is. However, for a brief recap, gradient boosting improves model performance by first developing an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).
What follows next is that gradient boosting looks at the error in the first model and develops a second model using what is called the loss function. The loss function calculates the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process is repeated with the creation of additional models until a certain level of accuracy or reduction in error is attained.
This post what provide an example of the use of gradient boosting in random forest classification. Specifically, we will try to predict a person’s labor participation based on several independent variables.
library(randomForest);library(gbm);library(caret);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Data Preparation
We need to transform the ‘age’ variable by multiplying by ten so that the ages are realistic. In addition, we need to convert “lnnlinc” from the log of salary to regular salary. Below is the code to transform these two variables.
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
We can now create our train and test datasets
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
We now need to create our grid and control. The grid allows us to create several different models with various parameter settings. This is important in determining what is the most appropriate model which is always determined by comparing. We are using random forest so we need to set the number of trees we desire, the depth of the trees, the shrinkage which controls the influence of each tree, and the minimum number of observations in a node. The control will allow us to set the cross-validation. Below is the code for the creation of the grid and control.
grid<-expand.grid(.n.trees=seq(200,500,by=200),.interaction.depth=seq(1,3,by=2),.shrinkage=seq(.01,.09,by=.04),
.n.minobsinnode=seq(1,5,by=2)) #grid features
control<-trainControl(method="CV",number = 10) #control
Parameter Selection
Now we set our seed and run the gradient boosted model.
set.seed(123)
gbm.lfp.train<-train(lfp~.,data=train,method='gbm',trControl=control,tuneGrid=grid)
gbm.lfp.train
## Stochastic Gradient Boosting
##
## 636 samples
## 6 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 573, 573, 571, 572, 573, 572, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy
## 0.01 1 1 200 0.6666026
## 0.01 1 1 400 0.6823306
## 0.01 1 3 200 0.6588637
## 0.01 1 3 400 0.6854804
## 0.01 1 5 200 0.6792769
## 0.01 1 5 400 0.6823306
## 0.01 3 1 200 0.6730044
## 0.01 3 1 400 0.6572051
## 0.01 3 3 200 0.6793273
## 0.01 3 3 400 0.6697787
## 0.01 3 5 200 0.6682914
## 0.01 3 5 400 0.6650416
## 0.05 1 1 200 0.6759558
## 0.05 1 1 400 0.6508040
## 0.05 1 3 200 0.6681426
## 0.05 1 3 400 0.6602286
## 0.05 1 5 200 0.6680441
## 0.05 1 5 400 0.6570788
## 0.05 3 1 200 0.6493662
## 0.05 3 1 400 0.6603518
## 0.05 3 3 200 0.6540545
## 0.05 3 3 400 0.6366911
## 0.05 3 5 200 0.6712428
## 0.05 3 5 400 0.6445299
## 0.09 1 1 200 0.6461405
## 0.09 1 1 400 0.6634768
## 0.09 1 3 200 0.6571036
## 0.09 1 3 400 0.6320765
## 0.09 1 5 200 0.6554922
## 0.09 1 5 400 0.6540755
## 0.09 3 1 200 0.6523920
## 0.09 3 1 400 0.6430140
## 0.09 3 3 200 0.6430666
## 0.09 3 3 400 0.6447749
## 0.09 3 5 200 0.6540522
## 0.09 3 5 400 0.6524416
## Kappa
## 0.3210036
## 0.3611194
## 0.3032151
## 0.3667274
## 0.3472079
## 0.3603046
## 0.3414686
## 0.3104335
## 0.3542736
## 0.3355582
## 0.3314006
## 0.3258459
## 0.3473532
## 0.2961782
## 0.3310251
## 0.3158762
## 0.3308353
## 0.3080692
## 0.2940587
## 0.3170198
## 0.3044814
## 0.2692627
## 0.3378545
## 0.2844781
## 0.2859754
## 0.3214156
## 0.3079460
## 0.2585840
## 0.3062307
## 0.3044324
## 0.3003943
## 0.2805715
## 0.2827956
## 0.2861825
## 0.3024944
## 0.3002135
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 400,
## interaction.depth = 1, shrinkage = 0.01 and n.minobsinnode = 3.
Gradient boosting provides us with the recommended parameters for our training model as shown above as well as the accuracy and kappa of each model. We also need to recode the dependent variable as 0 and 1 for the ‘gbm’ function.
Model Training
train$lfp=ifelse(train$lfp=="no",0,1)
gbm.lfp<-gbm(lfp~., distribution = 'bernoulli',data=train,n.trees = 400,interaction.depth = 1,shrinkage=.01,n.minobsinnode = 3)
You can see a summary of the most important variables for prediction as well as a plot by using the “summary” function.
summary(gbm.lfp)
## var rel.inf
## lnnlinc lnnlinc 28.680447
## age age 27.451474
## foreign foreign 23.307932
## nyc nyc 18.375856
## educ educ 2.184291
## noc noc 0.000000
Salary (lnnlinc), age and foreigner status are the most important predictors followed by the number of younger children (nyc) and last education. The number of older children (noc) has no effect. We can now test our model on the test set.
Model Testing
gbm.lfp.test<-predict(gbm.lfp,newdata = test,type = 'response', n.trees = 400)
Our test model returns a set of probabilities. We need to convert this to a simple yes or no and this is done in the code below.
gbm.class<-ifelse(gbm.lfp.test<0.5,'no','yes')
We can now look at a table to see how accurate our model is as well as calculate the accuracy.
table(gbm.class,test$lfp)
##
## gbm.class no yes
## no 91 39
## yes 39 67
(accuracy<-(91+67)/(91+67+39+39))
## [1] 0.6694915
The model is not great. However, you now have an example of how to use gradient boosting to develop a random forest classification model