Random Forest is a similar machine learning approach to decision trees. The main difference is that with random forest. At each node in the tree, the variable is bootstrapped. In addition, several different trees are made and the average of the trees are presented as the results. This means that there is no individual tree to analyze but rather a ‘forest’ of trees
The primary advantage of random forest is accuracy and prevent overfitting. In this post, we will look at an application of random forest in R. We will use the ‘College’ data from the ‘ISLR’ package to predict whether a college is public or private
Preparing the Data
First, we need to split our data into a training and testing set as well as load the various packages that we need. We have run this code several times when using machine learning. Below is the code to complete this.
library(ggplot2);library(ISLR) library(caret) data("College") forTrain<-createDataPartition(y=College$Private, p=0.7, list=FALSE) trainingset<-College[forTrain, ] testingset<-College[-forTrain, ]
Develop the Model
Next, we need to setup the model we want to run using Random Forest. The coding is similar to that which is used for regression. Below is the code
Model1<-train(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, method='rf',prox=TRUE)
We are using 7 variables to predict whether a university is private or not. The method is ‘rf’ which stands for “Random Forest”. By now, I am assuming you can read code and understand what the model is trying to do. For a refresher on reading code for a model please click here.
Reading the Output
If you type “Model1” followed by pressing enter, you will receive the output for the random forest
Random Forest 545 samples 17 predictors 2 classes: 'No', 'Yes' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 545, 545, 545, 545, 545, 545, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.8957658 0.7272629 0.01458794 0.03874834 4 0.8969672 0.7320475 0.01394062 0.04050297 7 0.8937115 0.7248174 0.01536274 0.04135164 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 4.
Most of this is self-explanatory. The main focus is on the mtry, accuracy, and Kappa.
The shows several different models that the computer generated. Each model reports the accuracy of the model as well as the Kappa. The accuracy states how well the model predicted accurately whether a university was public or private. The kappa shares the same information but it calculates how well a model predicted while taking into account chance or luck. As such, the Kappa should be lower than the accuracy.
At the bottom of the output, the computer tells which mtry was the best. For our example, the best mtry was number 4. If you look closely, you will see that mtry 4 had the highest accuracy and Kappa as well.
Confusion Matrix for the Training Data
Below is the confusion matrix for the training data using the model developed by the random forest. As you can see, the results are different from the random forest output. This is because this model is predicting without bootstrapping
> predNew<-predict(Model1, trainingset) > trainingset$predRight<-predNew==trainingset$Private > table(predNew, trainingset$Private) predNew No Yes No 149 0 Yes 0 396
Results of the Testing Data
We will now use the testing data to check the accuracy of the model we developed on the training data. Below is the code followed by the output
pred <- predict(Model1, testingset) testingset$predRight<-pred==testingset$Private table(pred, testingset$Private)
pred No Yes No 48 11 Yes 15 158
For the most part, the model we developed to predict if a university is private or not is accurate.
How Important is a Variable
You can calculate how important an individual variable is in the model by using the following code
Model1RF<-randomForest(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, importance=TRUE) importance(Model1RF)
The output tells you how much the accuracy of the model is reduced if you remove the variable. As such, the higher the number the more valuable the variable is in improving accuracy.
Conclusion
This post exposed you to the basics of random forest. Random forest is a technique that develops a forest of decisions trees through resampling. The results of all these trees are then averaged to give you an idea of which variables are most useful in prediction.