In this post, we will explore the use of the caret package for developing algorithms for use in machine learning. The caret package is particularly useful for processing data before the actual analysis of the algorithm.
When developing algorithms is common practice to divide the data into a training a testing subsamples. The training subsample is what is used to develop the algorithm while the testing sample is used to assess the predictive power of the algorithm. There are many different ways to divide a sample into a testing and training set and one of the main benefits of the “caret” package is in dividing the sample.
In the example we will use, we will return to the “kearnlab” example and this develop an algorithm after sub-setting the sample to have a training data set and a testing data set.
First, you need to download the ‘caret’ and ‘kearnlab’ package if you have not done so. After that below is the code for subsetting the ‘spam’ data from the ‘kearnlab’ package.
inTrain<- createDataPartition(y=spam$type, p=0.75, list=FALSE) training<-spam[inTrain,] testing<-spam[-inTrain,] dim(training)
Here is what we did
- We created the variable ‘inTrain’
- In the variable ‘inTrain’ we told R to make a partition in the data use the ‘createDataPartition’ function. I the parenthesis we told r to look at the dataset ‘spam’ and to examine the variable ‘type’. Then we told are to pull 75% of the data in ‘type’ and copy it to the ‘inTrain’ variable we created. List = False tells R not to make a list. If you look closely, you will see that the variable ‘type’ is being set as the y variable in the ‘inTrain’ data set. This means that all the other variables in the data set will be used as predictors. Also, remember that the ‘type’ variable has two outcomes “spam” or “nonspam”
- Next, we created the variable ‘training’ which is the dataset we will use for developing our algorithm. To make this we take the original ‘spam’ data and subset the ‘inTrain’ partition. Now all the data that is in the ‘inTrain’ partition is now in the ‘training’ variable.
- Finally, we create the ‘testing’ variable which will be used for testing the algorithm. To make this variable, we tell R to take everything that was not assigned to the ‘inTrain’ variable and put it into the ‘testing’ variable. This is done through the use of a negative sign
- The ‘dim’ function just tells us how many rows and columns we have as shown below.
 3451 58
As you can see, we have 3451 rows and 58 columns. Rows are for different observations and columns are for the variables in the data set.
Now to make the model. We are going to bootstrap our sample. Bootstrapping involves random sampling from the sample with replacement in order to assess the stability of the results. Below is the code for the bootstrap and model development followed by an explanation.
set.seed(32343) SpamModel<-train(type ~., data=training, method="glm") SpamModel
Here is what we did,
- Whenever you bootstrap, it is wise to set the seed. This allows you to reproduce the same results each time. For us, we set the seed to 32343
- Next, we developed the actual model. We gave the model the name “SpamModel” we used the ‘train’ function. Inside the parenthesis, we tell r to set “type” as the y variable and then use ~. which is a shorthand for using all other variables in the model as predictor variables. Then we set the data to the ‘training’ data set and indicate that the method is ‘glm’ which means generalized linear model.
- The output for the analysis is available at the link SpamModel
There is a lot of information but the most important information for us is the accuracy of the model which is 91.3%. The kappa stat tells us what the expected accuracy of the model is which is 81.7%. This means that our model is a little bit better than the expected accuracy.
For our final trick, we will develop a confusion matrix to assess the accuracy of our model using the ‘testing’ sample we made earlier. Below is the code
SpamPredict<-predict(SpamModel, newdata=testing) confusionMatrix(SpamPredict, testing$type)
Here is what we did,
- We made a variable called ‘SpamPredict’. We use the function ‘predict’ using the ‘SpamModel’ with the new data called ‘testing’.
- Next, we make matrix using the ‘confusionMatrix’ function using the new model ‘SpamPredict’ based on the ‘testing’ data on the ‘type’ variable. Below is the output
Prediction nonspam spam nonspam 657 35 spam 40 418 Accuracy : 0.9348 95% CI : (0.9189, 0.9484) No Information Rate : 0.6061 P-Value [Acc > NIR] : <2e-16 Kappa : 0.8637 Mcnemar's Test P-Value : 0.6442 Sensitivity : 0.9426 Specificity : 0.9227 Pos Pred Value : 0.9494 Neg Pred Value : 0.9127 Prevalence : 0.6061 Detection Rate : 0.5713 Detection Prevalence : 0.6017 Balanced Accuracy : 0.9327 'Positive' Class : nonspam
The accuracy of the model actually improved to 93% on the test data. The other values such as sensitivity and specificity have to do with such things as looking at correct classifications divided by false negatives and other technical matters. As you can see, machine learning is a somewhat complex experience