Understanding Classification Trees Using R

Classification trees are similar to regression trees except that the determinant of success is not the residual sum of squares but rather the error rate. The strange thing about classification trees is that you can you can continue to gain information in splitting the tree without necessarily improving the misclassification rate. This is done through calculating a measure of error called the Gini coefficient

Gini coefficient is calculated using the values of the accuracy and error in an equation. For example, if we have a model that is 80% accurate with a 20% error rate the Gini coefficient is calculated as follows for a single node

n0gini<- 1 - (((8/10)^2) -((2/10)^2)) 
n0gini
## [1] 0.4

Now if we split this into two nodes notice the change in the Gini coefficient

n1gini<-1-(((5/6)^2)-((1/7)^2))
n2gini<-1-(((3/4)^2))-((1/3)^2)
newgini<-(.8*n1gini) + (.2*n2gini)
newgini
## [1] 0.3260488

The lower the Gini coefficient the better as it measures purity. IN the example, there is no improvement in the accuracy yet there is an improvement in the Gini coefficient. Therefore, classification is about purity and not the residual sum of squares.

In this post, we will make a classification tree to predict if someone is participating in the labor market. We will do this using the “Participation” dataset from the “Ecdat” package. Below is some initial code to get started.

library(Ecdat);library(rpart);library(partykit)
data(Participation)
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The ‘age’ feature needs to be transformed. Since it is doubtful that the survey was conducted among 4 and 5-year-olds. We need to multiply this variable by ten. In addition, the “lnnlinc” feature is the log of income and we want the actual income so we will exponentiate this information. Below is the code for these two steps.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

We will now create our training and testing datasets with the code below.

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We can now create our classification tree and take a look at the output

tree.pros<-rpart(lfp~.,data = train)
tree.pros
## n= 636 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 636 295 no (0.5361635 0.4638365)  
##     2) foreign=no 471 182 no (0.6135881 0.3864119)  
##       4) nyc>=0.5 99  21 no (0.7878788 0.2121212) *
##       5) nyc< 0.5 372 161 no (0.5672043 0.4327957)   ##        10) age>=49.5 110  25 no (0.7727273 0.2272727) *
##        11) age< 49.5 262 126 yes (0.4809160 0.5190840)   ##          22) lnnlinc>=46230.43 131  50 no (0.6183206 0.3816794)  
##            44) noc>=0.5 102  34 no (0.6666667 0.3333333) *
##            45) noc< 0.5 29  13 yes (0.4482759 0.5517241)   ##              90) lnnlinc>=47910.86 22  10 no (0.5454545 0.4545455)  
##               180) lnnlinc< 65210.78 12   3 no (0.7500000 0.2500000) * ##               181) lnnlinc>=65210.78 10   3 yes (0.3000000 0.7000000) *
##              91) lnnlinc< 47910.86 7   1 yes (0.1428571 0.8571429) *
##          23) lnnlinc< 46230.43 131  45 yes (0.3435115 0.6564885) * ##     3) foreign=yes 165  52 yes (0.3151515 0.6848485)   ##       6) lnnlinc>=56365.39 16   5 no (0.6875000 0.3125000) *
##       7) lnnlinc< 56365.39 149  41 yes (0.2751678 0.7248322) *

In the text above, the first split is made on the feature “foreign” which is a yes or no possibility. 471 were not foreigners will 165 were foreigners. The accuracy here is not great at 61% for those not classified as foreigners and 31% for those classified as foreigners. For the 165 that are classified as foreigners, the next split is by their income, etc. This is hard to understand. Below is an actual diagram of the text above.

plot(as.party(tree.pros))

1

We now need to determining if pruning the tree is beneficial. We do this by looking at the cost complexity. Below is the code.

tree.pros$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.20677966      0 1.0000000 1.0000000 0.04263219
## 2 0.04632768      1 0.7932203 0.7932203 0.04122592
## 3 0.02033898      4 0.6542373 0.6677966 0.03952891
## 4 0.01016949      5 0.6338983 0.6881356 0.03985120
## 5 0.01000000      8 0.6033898 0.6915254 0.03990308

The “rel error” indicates that our model is bad no matter how any splits. Even with 9 splits we have an error rate of 60%. Below is a plot of the table above

plotcp(tree.pros)

1.png

Based on the table, we will try to prune the tree to 5 splits. The plot above provides a visual as it has the lowest error. The table indicates that a tree of five splits (row number 4) has the lowest cross-validation error (xstd). Below is the code for pruning the tree followed by a plot of the modified tree.

cp<-min(tree.pros$cptable[4,])
pruned.tree.pros<-prune(tree.pros,cp=cp)
plot(as.party(pruned.tree.pros))

1

IF you compare the two trees we have developed. One of the main differences is that the pruned.tree is missing the “noc” (number of older children) variable. There are also fewer splits on the income variable (lnnlinc). We can no use the pruned tree with the test data set.

party.pros.test<-predict(pruned.tree.pros,newdata=test,type="class")
table(party.pros.test,test$lfp)
##                
## party.pros.test no yes
##             no  90  41
##             yes 40  65

Now for the accuracy

(90+65) / (90+41+40+65)
## [1] 0.6567797

This is surprisingly high compared to the results for the training set but 65% is not great, However, this is fine for a demonstration.

Conclusion

Classification trees are one of many useful tools available for data analysis. When developing classification trees one of the key ideas to keep in mind is the aspect of prunning as this affects the complexity of the model.

Advertisements

One thought on “Understanding Classification Trees Using R

  1. Amadeo

    It is a great post. I am a beginner in this subject and is difficult understand the rules of the tree. Can you help me with this? Thanks!

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s