Developing an Automatically Tuned Model in R

In this post, we are going to learn how to use the “caret” package to automatically tune a machine learning model. This is perhaps the simplest way to evaluate the performance of several models. In a later post, we will explore how to perform custom tuning to a model.

The model we are trying to tune is the decision tree we made using the C5.0 algorithm in a previous post. Specifically we were trying to predict sex based on the variables available in the “Wages1” dataset in the “Ecdat” package.

In order to accomplish our goal we will need to load the “caret” and “Ecdat”package, load the “Wages1” dataset as well as set the seed. Setting the seed will allow us to reproduce our results. Below is the code for these steps.

library(caret); library(Ecdat)
data(Wages1)
set.seed(1)

We will now build and display our model using the code below.

tuned_model<-train(sex ~., data=Wages1, method="C5.0")
tuned_model
## C5.0 
## 
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3294, 3294, 3294, 3294, 3294, 3294, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   rules  FALSE    1      0.5892713  0.1740587  0.01262945   0.02526656
##   rules  FALSE   10      0.5938071  0.1861964  0.01510209   0.03000961
##   rules  FALSE   20      0.5938071  0.1861964  0.01510209   0.03000961
##   rules   TRUE    1      0.5892713  0.1740587  0.01262945   0.02526656
##   rules   TRUE   10      0.5938071  0.1861964  0.01510209   0.03000961
##   rules   TRUE   20      0.5938071  0.1861964  0.01510209   0.03000961
##   tree   FALSE    1      0.5841768  0.1646881  0.01255853   0.02634012
##   tree   FALSE   10      0.5930511  0.1855230  0.01637060   0.03177075
##   tree   FALSE   20      0.5930511  0.1855230  0.01637060   0.03177075
##   tree    TRUE    1      0.5841768  0.1646881  0.01255853   0.02634012
##   tree    TRUE   10      0.5930511  0.1855230  0.01637060   0.03177075
##   tree    TRUE   20      0.5930511  0.1855230  0.01637060   0.03177075
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were trials = 10, model = rules
##  and winnow = TRUE.

There is a lot of information that is printed out. The first column is the type of model developed. Two types of models were developed either a rules-based classification tree or a normal decision tree. Next, is the winnow column. This column indicates if a winnowing process was used to remove poor predictor variables.

The next two columns are accuracy and kappa which have been explained previously. The last two columns are the standard deviations of accuarcy and kappa. None of the models are that good but the purpose here is for teaching.

At the bottom of the printout, r tells you which model was the best. For us, the best model was the fifth model from the top which was a rule-based, 10 trial model with winnow set to “TRUE”.

We will now use the best model (the caret package automatically picks it) to make predictions on the training data. We will also look at the confusion matrix of the correct classification followed by there proportions. Below is the code.

predict_model<-predict(tuned_model, Wages1)
table(predict_model, Wages1$sex)
##              
## predict_model female male
##        female    936  590
##        male      633 1135
prop.table(table(predict_model, Wages1$sex))
##              
## predict_model    female      male
##        female 0.2841530 0.1791135
##        male   0.1921676 0.3445659

In term of prediction, the model was correct 62% of the time (.28 + .34 = .62). If we want to know, can also see the probabilities for each example using the following code.

probTable<-(predict(tuned_model, Wages1, type="prob"))
head(probTable)
##      female       male
## 1 0.6191287 0.38087132
## 2 0.2776770 0.72232303
## 3 0.2975327 0.70246734
## 4 0.7195866 0.28041344
## 5 1.0000000 0.00000000
## 6 0.9092993 0.09070072

Conclusion

In this post, we looked at an automated way to determine the best model among many using the “caret” package. Understanding how to improve the performance of a model is critical skill in machine learning.

Leave a Reply