# Predicting with Caret

In this post, we will explore the use of the caret package for developing algorithms for use in machine learning. The caret package is particularly useful for processing data before the actual analysis of the algorithm.

When developing algorithms is common practice to divide the data into a training a testing subsamples. The training subsample is what is used to develop the algorithm while the testing sample is used to assess the predictive power of the algorithm. There are many different ways to divide a sample into a testing and  training set and one of the main benefits of the “caret” package is in dividing the sample.

In the example we will use, we will return to the “kearnlab” example and this develop an algorithm after sub-setting the sample to have a training data set and a testing data set.

First, you need to download the ‘caret’ and ‘kearnlab’ package if you have not done so. After that below is the code for subsetting the ‘spam’ data from the ‘kearnlab’ package.

```inTrain<- createDataPartition(y=spam\$type, p=0.75,
list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
dim(training)
```

Here is what we did

1. We created the variable ‘inTrain’
2. In the variable ‘inTrain’ we told R to make a partition in the data use the ‘createDataPartition’ function. I the parenthesis we told r to look at the dataset ‘spam’ and to examine the variable ‘type’. Then we told are to pull 75% of the data in ‘type’ and copy it to the ‘inTrain’ variable we created. List = False tells R not to make a list. If you look closely, you will see that the variable ‘type’ is being set as the y variable in the ‘inTrain’ data set. This means that all the other variables in the data set will be used as predictors. Also, remember that the ‘type’ variable has two outcomes “spam” or “nonspam”
3. Next, we created the variable ‘training’ which is the dataset we will use for developing our algorithm. To make this we take the original ‘spam’ data and subset the ‘inTrain’ partition. Now all the data that is in the ‘inTrain’ partition is now in the ‘training’ variable.
4. Finally, we create the ‘testing’ variable which will be used for testing the algorithm. To make this variable, we tell R to take everything that was not assigned to the ‘inTrain’ variable and put it into the ‘testing’ variable. This is done through the use of a negative sign
5. The ‘dim’ function just tells us how many rows and columns we have as shown below.
` 3451   58`

As you can see, we have 3451 rows and 58 columns. Rows are for different observations and columns are for the variables in the data set.

Now to make the model. We are going to bootstrap our sample. Bootstrapping involves random sampling from the sample with replacement in order to assess the stability of the results. Below is the code for the bootstrap and model development followed by an explanation.

```set.seed(32343)
SpamModel<-train(type ~., data=training, method="glm")
SpamModel```

Here is what we did,

1. Whenever you bootstrap, it is wise to set the seed. This allows you to reproduce the same results each time. For us, we set the seed to 32343
2. Next, we developed the actual model. We gave the model the name “SpamModel” we used the ‘train’ function. Inside the parenthesis, we tell r to set “type” as the y variable and then use ~. which is a shorthand for using all other variables in the model as predictor variables. Then we set the data to the ‘training’ data set and indicate that the method is ‘glm’ which means generalized linear model.
3. The output for the analysis is available at the link SpamModel

There is a lot of information but the most important information for us is the accuracy of the model which is 91.3%. The kappa stat tells us what the expected accuracy of the model is which is 81.7%. This means that our model is a little bit better than the expected accuracy.

For our final trick, we will develop a confusion matrix to assess the accuracy of our model using the ‘testing’ sample we made earlier. Below is the code

```SpamPredict<-predict(SpamModel, newdata=testing)
confusionMatrix(SpamPredict, testing\$type)```

Here is what we did,

1. We made a variable called ‘SpamPredict’. We use the function ‘predict’ using the ‘SpamModel’ with the new data called ‘testing’.
2. Next, we make matrix using the ‘confusionMatrix’ function using the new model ‘SpamPredict’ based on the ‘testing’ data on the ‘type’ variable. Below is the output

Reference

1. ```Prediction nonspam spam
nonspam     657   35
spam         40  418

Accuracy : 0.9348
95% CI : (0.9189, 0.9484)
No Information Rate : 0.6061
P-Value [Acc > NIR] : <2e-16

Kappa : 0.8637
Mcnemar's Test P-Value : 0.6442

Sensitivity : 0.9426
Specificity : 0.9227
Pos Pred Value : 0.9494
Neg Pred Value : 0.9127
Prevalence : 0.6061
Detection Rate : 0.5713
Detection Prevalence : 0.6017
Balanced Accuracy : 0.9327

'Positive' Class : nonspam```

The accuracy of the model actually improved to 93% on the test data. The other values such as sensitivity and specificity have to do with such things as looking at correct classifications divided by false negatives and other technical matters. As you can see, machine learning is a somewhat complex experience