In evaluating a model when employing machine learning techniques, there are three common types of data used for evaluation.
- The actual classification values
- The predicted classification values
- The estimated probability of the prediction
The first two types of data (actual and predicted) are used for assessing the accuracy of a model in several different ways such as error rate, sensitivity, specificity, etc.
The benefit of the probabilities of prediction is that it is a measure of a model’s confidence in its prediction. If you need to compare to models and one is more confident in it’s prediction of its classification of examples, the more confident model is the better learner.
In this post, we will look at examples of the probability predictions of several models that have been used in this blog in the past.
Prediction Probabilities for Decision Trees
Our first example come from the decision tree we made using the C5.0 algorithm. Below is the code for calculating the probability of the correct classification of each example in the model followed by an output of the first
Wage_pred_prob<-predict(Wage_model, Wage_test, type="prob")
head(Wage_pred_prob) female male 497 0.2853016 0.7146984 1323 0.2410568 0.7589432 1008 0.5770177 0.4229823 947 0.6834378 0.3165622 695 0.5871323 0.4128677 1368 0.4303364 0.5696636
The argument “type” is added to the “predict” function so that R calculates the probability that the example is classified correctly. A close look at the results using the “head” function provides a list of 6 examples from the model.
- For example 497, there is a 28.5% probability that this example is female and a 71.5% probability that this example is male. Therefore, the model predicts that this example is male.
- For example 1322, there is a 24% probability that this example is female and a 76% probability that this example is male. Therefore, the model predicts that this example is male.
Prediction Probabilities for KNN Nearest Neighbor
Below is the code for finding the probilities for KNN algorithm.
College_test_pred_prob<-knn(train=College_train, test=College_test, + cl=College_train_labels, k=27, prob=TRUE)
The print for this is rather long. However, you can match the predict level with the actual probability by looking carefully at the data.
- For example 1, there is a 77% probability that this example is a yes and a 23% probability that this example is a no. Therefore, the model predicts that this example as yes.
- For example 2, there is a 71% probability that this example is no and a 29% probability that this example is yes. Therefore, the model predicts that this example is a no.
One of the primary purposes of the probabilities option is in comparing various models that are derived from the same data. This information combined with other techniques for evaluating models can help a researcher in determining the most appropriate model of analysis.