In this post, we will explore the potential of bagging. Bagging is a process in which the original data is bootstrapped to make several different datasets. Each of these datasets are used to generate a model and voting is used to classify an example or averaging is used for numeric prediction.
Bagging is especially useful for unstable learners. These are algorithms who generate models that can change a great deal when the data is modified a small amount. In order to complete this example, you will need to load the following packages, set the seed, as well as load the dataset “Wages1”. We will be using a decision tree that was developed in an earlier post. Below is the initial code
We will now use the “bagging” function from the “ipred” package to create our model as well as tell R how many bags to make.
Next, we will make our predictions. Then we will check the accuracy of the model looking at a confusion matrix and the kappa statistic. The “kappa” function comes from the “vcd” package.
bagPred<-predict(theBag, Wages1) keep<-table(bagPred, Wages1$sex) keep
## ## bagPred female male ## female 1518 52 ## male 51 1673
## value ASE z Pr(>|z|) ## Unweighted 0.9373 0.006078 154.2 0 ## Weighted 0.9373 0.006078 154.2 0
The results appearing exciting with almost 97% accuracy. In addition, the Kappa was almost 0.94 indicating a well-fitted model. However, in order to further confirm this, we can cross-validate the model instead of using bootstrap aggregating as bagging does. Therefore we will do a 10-fold cross-validation using the functions from the “caret” package. Below is the code.
ctrl<-trainControl(method="cv", number=10) trainModel<-train(sex~.,data=Wages1, method="treebag",trControl=ctrl)
## Bagged CART ## ## 3294 samples ## 3 predictors ## 2 classes: 'female', 'male' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 2965, 2965, 2965, 2964, 2964, 2964, ... ## Resampling results ## ## Accuracy Kappa Accuracy SD Kappa SD ## 0.5504128 0.09712194 0.02580514 0.05233441 ## ##
Now the results are not so impressive. In many ways the model is terrible. The accuracy has fallen significantly and the kappa is almost 0. Remeber that cross-validation is an indicator of future performance. This means that our current model would not generalize well to other datasets.
Bagging is not limited to decision trees and can be used for all machine learning models. The example used in this post was one that required the least time to run. For real datasets, the processing time can be quite long for the average laptop.