In this post, we are going to learn how to use the “caret” package to automatically tune a machine learning model. This is perhaps the simplest way to evaluate the performance of several models. In a later post, we will explore how to perform custom tuning to a model.
The model we are trying to tune is the decision tree we made using the C5.0 algorithm in a previous post. Specifically we were trying to predict sex based on the variables available in the “Wages1” dataset in the “Ecdat” package.
In order to accomplish our goal we will need to load the “caret” and “Ecdat”package, load the “Wages1” dataset as well as set the seed. Setting the seed will allow us to reproduce our results. Below is the code for these steps.
library(caret); library(Ecdat)
data(Wages1) set.seed(1)
We will now build and display our model using the code below.
tuned_model<-train(sex ~., data=Wages1, method="C5.0")
tuned_model
## C5.0
##
## 3294 samples
## 3 predictors
## 2 classes: 'female', 'male'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3294, 3294, 3294, 3294, 3294, 3294, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa Accuracy SD Kappa SD
## rules FALSE 1 0.5892713 0.1740587 0.01262945 0.02526656
## rules FALSE 10 0.5938071 0.1861964 0.01510209 0.03000961
## rules FALSE 20 0.5938071 0.1861964 0.01510209 0.03000961
## rules TRUE 1 0.5892713 0.1740587 0.01262945 0.02526656
## rules TRUE 10 0.5938071 0.1861964 0.01510209 0.03000961
## rules TRUE 20 0.5938071 0.1861964 0.01510209 0.03000961
## tree FALSE 1 0.5841768 0.1646881 0.01255853 0.02634012
## tree FALSE 10 0.5930511 0.1855230 0.01637060 0.03177075
## tree FALSE 20 0.5930511 0.1855230 0.01637060 0.03177075
## tree TRUE 1 0.5841768 0.1646881 0.01255853 0.02634012
## tree TRUE 10 0.5930511 0.1855230 0.01637060 0.03177075
## tree TRUE 20 0.5930511 0.1855230 0.01637060 0.03177075
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = rules
## and winnow = TRUE.
There is a lot of information that is printed out. The first column is the type of model developed. Two types of models were developed either a rules-based classification tree or a normal decision tree. Next, is the winnow column. This column indicates if a winnowing process was used to remove poor predictor variables.
The next two columns are accuracy and kappa which have been explained previously. The last two columns are the standard deviations of accuarcy and kappa. None of the models are that good but the purpose here is for teaching.
At the bottom of the printout, r tells you which model was the best. For us, the best model was the fifth model from the top which was a rule-based, 10 trial model with winnow set to “TRUE”.
We will now use the best model (the caret package automatically picks it) to make predictions on the training data. We will also look at the confusion matrix of the correct classification followed by there proportions. Below is the code.
predict_model<-predict(tuned_model, Wages1)
table(predict_model, Wages1$sex)
##
## predict_model female male
## female 936 590
## male 633 1135
prop.table(table(predict_model, Wages1$sex))
##
## predict_model female male
## female 0.2841530 0.1791135
## male 0.1921676 0.3445659
In term of prediction, the model was correct 62% of the time (.28 + .34 = .62). If we want to know, can also see the probabilities for each example using the following code.
probTable<-(predict(tuned_model, Wages1, type="prob"))
head(probTable)
## female male
## 1 0.6191287 0.38087132
## 2 0.2776770 0.72232303
## 3 0.2975327 0.70246734
## 4 0.7195866 0.28041344
## 5 1.0000000 0.00000000
## 6 0.9092993 0.09070072
Conclusion
In this post, we looked at an automated way to determine the best model among many using the “caret” package. Understanding how to improve the performance of a model is critical skill in machine learning.