Random Forest in R

Random Forest is a similar machine learning approach to decision trees. The main difference is that with random forest. At each node in the tree, the variable is bootstrapped. In addition, several different trees are made and the average of the trees are presented as the results. This means that there is no individual tree to analyze but rather a ‘forest’ of trees

The primary advantage of random forest is accuracy and prevent overfitting. In this post, we will look at an application of random forest in R. We will use the ‘College’ data from the ‘ISLR’ package to predict whether a college is public or private

Preparing the Data

First, we need to split our data into a training and testing set as well as load the various packages that we need. We have run this code several times when using machine learning. Below is the code to complete this.

library(ggplot2);library(ISLR)
library(caret)
data("College")
forTrain<-createDataPartition(y=College$Private, p=0.7, list=FALSE)
trainingset<-College[forTrain, ]
testingset<-College[-forTrain, ]

Develop the Model

Next, we need to setup the model we want to run using Random Forest. The coding is similar to that which is used for regression. Below is the code

Model1<-train(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, method='rf',prox=TRUE)

We are using 7 variables to predict whether a university is private or not. The method is ‘rf’ which stands for “Random Forest”. By now, I am assuming you can read code and understand what the model is trying to do. For a refresher on reading code for a model please click here.

Reading the Output

If you type “Model1” followed by pressing enter, you will receive the output for the random forest

Random Forest 

545 samples
 17 predictors
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 545, 545, 545, 545, 545, 545, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
  2     0.8957658  0.7272629  0.01458794   0.03874834
  4     0.8969672  0.7320475  0.01394062   0.04050297
  7     0.8937115  0.7248174  0.01536274   0.04135164

Accuracy was used to select the optimal model using 
 the largest value.
The final value used for the model was mtry = 4.

Most of this is self-explanatory. The main focus is on the mtry, accuracy, and Kappa.

The shows several different models that the computer generated. Each model reports the accuracy of the model as well as the Kappa. The accuracy states how well the model predicted accurately whether a university was public or private. The kappa shares the same information but it calculates how well a model predicted while taking into account chance or luck. As such, the Kappa should be lower than the accuracy.

At the bottom of the output, the computer tells which mtry was the best. For our example, the best mtry was number 4. If you look closely, you will see that mtry 4 had the highest accuracy and Kappa as well.

Confusion Matrix for the Training Data

Below is the confusion matrix for the training data using the model developed by the random forest. As you can see, the results are different from the random forest output. This is because this model is predicting without bootstrapping

> predNew<-predict(Model1, trainingset) > trainingset$predRight<-predNew==trainingset$Private > table(predNew, trainingset$Private)
       
predNew  No Yes
    No  149   0
    Yes   0 396

Results of the Testing Data

We will now use the testing data to check the accuracy of the model we developed on the training data. Below is the code followed by the output

pred <- predict(Model1, testingset)
testingset$predRight<-pred==testingset$Private
table(pred, testingset$Private)
pred   No Yes
  No   48  11
  Yes  15 158

For the most part, the model we developed to predict if a university is private or not is accurate.

How Important is a Variable

You can calculate how important an individual variable is in the model by using the following code

Model1RF<-randomForest(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, importance=TRUE)
importance(Model1RF)

The output tells you how much the accuracy of the model is reduced if you remove the variable. As such, the higher the number the more valuable the variable is in improving accuracy.

Conclusion

This post exposed you to the basics of random forest. Random forest is a technique that develops a forest of decisions trees through resampling. The results of all these trees are then averaged to give you an idea of which variables are most useful in prediction.

Leave a Reply