Support Vector Machines in R

In this post, we will use support vector machine analysis to look at some data available on kaggle. In particular, we will predict what number a person wrote by analyzing the pixels that were used to make the number. The file for this example is available at https://www.kaggle.com/c/digit-recognizer/data

To do this analysis you will need to use the ‘kernlab’ package. While playing with this dataset I noticed a major problem, doing the analysis with the full data set of 42000 examples took forever. To alleviate this problem. We are going to practice with a training set of 7000 examples and a test set of 3000. Below is the code for the first few sets. Remember that the dataset was downloaded separately

#load packages
library(kernlab)
digitTrain<-read.csv(file="digitTrain.csv",head=TRUE,sep=",")
#split data 
digitRedux<-digitTrain[1:7000,]
digitReduxTest<-digitTrain[7001:10000,]
#explore data
str(digitRedux)
## 'data.frame':    7000 obs. of  785 variables:
##  $ label   : int  1 0 1 4 0 0 7 3 5 3 ...
##  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel3  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel4  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel5  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel6  : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

From the “str” function you can tell we have a lot of variables (785). This is what slowed the analysis down so much when I tried to run the full 42000 examples in the original dataset.

SVM need a factor variable as the predictor if possible. We are trying to predict the “label” variable so we are going to change this to a factor variable because that is what it really is. Below is the code

#convert label variable to factor
digitRedux$label<-as.factor(digitRedux$label)
digitReduxTest$label<-as.factor(digitReduxTest$label)

Before we continue with the analysis we need to scale are variables. This makes all variables to be within the same given range which helps to equalize the influence of them. However, we do not want to change our “label” variable as this is the predictor variable and scaling it would make the results hard to understand. Therefore, we are going to temporarily remove the “label” variable from both of our data sets and save them in a temporary data frame. The code is below.

#temporary dataframe for the label results
keep<-as.data.frame(digitRedux$label)
keeptest<-as.data.frame(digitReduxTest$label)
#null label variable in both datasets
digitRedux$label<-NA
digitReduxTest$label<-NA

Next, we scale the remaining variable and reinsert the label variables for each data set as show in our code below.

digitRedux<-as.data.frame(scale(digitRedux))
digitRedux[is.na(digitRedux)]<- 0 #replace NA with 0
digitReduxTest<-as.data.frame(scale(digitReduxTest))
digitReduxTest[is.na(digitReduxTest)]<- 0
#add back label
digitRedux$label<-keep$`digitRedux$label`
digitReduxTest$label<-keeptest$`digitReduxTest$label`

Now we make our model using the “ksvm” function in the “kernlab” package. We set the kernel to “vanilladot” which is a linear kernel. We will also print the results. However, the results do not make any sense on their own and the model can only be assessed through other means. Below is the code. If you get a warning message about scaling do not worry about this as we scaled the data ourselves.

#make the model
number_classify<-ksvm(label~.,  data=digitRedux, 
                      kernel="vanilladot")
#look at the results
number_classify
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 2218 
## 
## Objective Function Value : -0.0623 -0.207 -0.1771 -0.0893 -0.3207 -0.4304 -0.0764 -0.2719 -0.2125 -0.3575 -0.2776 -0.1618 -0.3408 -0.1108 -0.2766 -1.0657 -0.3201 -1.0509 -0.2679 -0.4565 -0.2846 -0.4274 -0.8681 -0.3253 -0.1571 -2.1586 -0.1488 -0.2464 -2.9248 -0.5689 -0.2753 -0.2939 -0.4997 -0.2429 -2.336 -0.8108 -0.1701 -2.4031 -0.5086 -0.0794 -0.2749 -0.1162 -0.3249 -5.0495 -0.8051 
## Training error : 0

We now need to use the “predict” function so that we can determine the accuracy of our model. Remember that for predicting, we use the answers in the test data and compare them to what our model would guess based on what it knows.

number_predict<-predict(number_classify, digitReduxTest)
table(number_predict, digitReduxTest$label)
##               
## number_predict   0   1   2   3   4   5   6   7   8   9
##              0 297   0   3   3   1   4   6   1   1   1
##              1   0 307   4   1   0   4   2   5  11   1
##              2   0   2 268  10   5   1   3  10  12   3
##              3   0   1   7 291   1  11   0   1   8   3
##              4   0   1   3   0 278   4   3   2   0   9
##              5   2   0   1  10   1 238   4   1  11   1
##              6   2   1   1   0   2   1 287   1   0   0
##              7   0   1   1   0   1   0   0 268   3  10
##              8   1   3   4  10   0  11   1   0 236   2
##              9   0   0   2   2   9   2   0  14   2 264
accuracy<-number_predict == digitReduxTest$label
prop.table(table(accuracy))
## accuracy
##      FALSE       TRUE 
## 0.08866667 0.91133333

The table allows you to see how many were classified correctly and how they were misclassified. The prop.table allows you to see an overall percentage. This particular model was highly accurate at 91%. It would be difficult to improve further. Below is code for a model that is using a different kernel with results that are barely better. However, if you ever enter a data science competition any improve ususally helps even if it is not practical for everyday use.

number_classify_rbf<-ksvm(label~.,  data=digitRedux, 
                          kernel="rbfdot")
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
#evaluate improved model
number_predict_rbf<-predict(number_classify_rbf, digitReduxTest)
table(number_predict_rbf, digitReduxTest$label)
##                   
## number_predict_rbf   0   1   2   3   4   5   6   7   8   9
##                  0 294   0   2   3   1   1   2   1   1   0
##                  1   0 309   1   0   1   0   1   3   4   2
##                  2   4   2 277  12   4   4   8   9   9   8
##                  3   0   1   3 297   1   3   0   1   3   3
##                  4   0   1   3   0 278   4   1   5   0   6
##                  5   0   0   1   6   1 254   4   0   9   1
##                  6   2   1   0   0   2   6 289   0   2   0
##                  7   0   0   3   2   3   0   0 277   2  13
##                  8   2   2   4   3   0   3   1   0 253   3
##                  9   0   0   0   4   7   1   0   7   1 258
accuracy_rbf<-number_predict_rbf == digitReduxTest$label
prop.table(table(accuracy_rbf))
## accuracy_rbf
##      FALSE       TRUE 
## 0.07133333 0.92866667

Conclusion

From this demonstration we can see the power of support vector machines with numerical data. This type of analysis can be used for things beyond the conventional analysis and can be used to predict things such as hand written numbers. As such, SVM is yet another tool available for the data scientist.

1 thought on “Support Vector Machines in R

  1. Pingback: Support Vector Machines in R | Education and Re...

Leave a Reply