Nearest Neighbor Classification in R

In this post, we will conduct a nearest neighbor classification using R. In a previous post, we discussed nearest neighbor classification. To summarize, nearest neighbor uses the traits of a known example to classify an unknown example. The classification is determined by the closets known example(s) to the unknown example. There are essentially four steps in order to complete a nearest neighbor classification

  1. Find a dataset
  2. Explore/prepare the Dataset
  3. Train the model
  4. Evaluate the model

For this example, we will use the college data set from the ISLR package. Our goal will be to predict which colleges are private or not private based on the feature “Private”. Below is the code for this.

library(ISLR)
 data("College")

Step 2 Exploring the Data

We now need to explore and prep the data for analysis. Exploration helps us to find any problems in the data set. Below is the code, you can see the results in your own computer if you are following along.

str(College)
prop.table(table(College$Private))
summary(College)

The “str” function gave us an understanding of the different types of variables and some of there initial values. We have 18 variables in all. We want to predict “Private” which is a categorical feature Nearest neighbor predicts categorical features only. We will use all of the other numerical variables to predict “Private”. The prop.table give us the proportion of private and not private colleges. About 30% of the data is not private and about 70% is a private college

Lastly, the “summary” function gives us some descriptive stats. If you look closely at the descriptive stats, there is a problem. The variables use different scales. For example, the “Apps” feature goes from 81 to 48094 while the “Grad.Rate” feature goes from 10 to 100. If we do the analysis with these different scales the “App” feature will be a much stronger influence on the prediction. Therefore, we need to rescale the features or normalize them. Below is the code to do this

 normal<-function(x){
 return((x-min(x))/(max(x)))
 }

We made a function called “normal” that normalizes or scales are variables so that all values are between 0-1. This makes all of the features equal in their influence. We now run code to normalize our data using the “normal” function we created. We will also look at the summary stats. In addition, we will leave out the prediction feature “Private” because we do not want to have that in the new data set because we want to predict this. In a real-world example, you would not know this before the analysis anyway.

College_New<-as.data.frame(lapply(College[2:18],normal))
summary(College_New)

The name of the new dataset is “College_New” we used the “lapply” function to make R normalize all the features in the “College” dataset. Summary stats are good as all values are between 0 and 1. The last part of this step involves dividing our data into training and testing sets as seen in the code below and creating the labels. The labels are the actual information from the “Private” feature. Remember, we remove this from the “College_New” dataset but we need to put this information in their own features to allow us to check the accuracy of our results later.

College_train<-College_New[1:677, ]
 College_Test<-College_New[678:777,]
 #make labels
 College_train_labels<-College[1:677, 1]
 College_test_labels<-College[678:777,1]

Step 3 Model Training

Train the model We will now train the model. You will need to download the “class” package and load it. After that, you need to run the following code.

library(class)
 College_test_pred<-knn(train=College_train, test=College_Test,
 cl=College_train_labels, k=25)

Here is what we did

  • We created a variable called “College_test_pred” to store our results
  • We used the “knn” function to predict the examples. Inside the “knn” function we used “College_train” to teach the model.
  • We then identify “College_test” as the dataset we want to test. For the ‘cl’ argument we used the “College_train_labels” dataset which contains which colleges are private in the “College_train” dataset. The “k” is the number of neighbors that knn uses to determine what class to label an unknown example. In this code, we are using the 25 nearest neighbors to label the unknown example The number of “k” to use can vary but a rule of thumb is to take the square root of the total number of examples in the training data. Our training data has 677 and the square root of this is about 25. What happens is that each of the 25 closest neighbors are calculated for an example. If 20 are not private and 5 are private the unknown example is labeled private because the majority of its neighbors are not private.

Step 4 Evaluating the Model

We will now evaluate the model using the following code

 library(gmodels)
 CrossTable(x=College_test_labels, y=College_test_pred, prop.chisq = FALSE)

The results can be seen by clicking here.  The box in the top left predicts which colleges are private correctly while the box in the bottom right classifies which colleges are not private correctly. For example, 28 colleges that are not private were classed as not private while 64 colleges that are not private were classed as not private. Adding these two numbers together gives us 92 (28+64) which is our accuracy rate. In the top right box, we have are false positives. These are colleges which are private but were predicted as private, there were 8 of these. Lastly, we have our false negatives in the bottom left box, which are colleges that are private but label as not private, there were 0 of these.

Conclusion

The results are pretty good for this example. Nearest neighbor classification allows you to predict an unknown example based on the classification of the known neighbors. This is a simple way to identify examples that are clump among neighbors with the same characteristics.

Advertisements

3 thoughts on “Nearest Neighbor Classification in R

  1. Pingback: Nearest Neighbor Classification in R | Educatio...

  2. Pingback: Using Probability of the Prediction to Evaluate a Machine Learning Model | educational research techniques

  3. Pingback: Receiver Operating Characteristic Curve | educational research techniques

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s