Regression Tree Development in R

In this post, we will take a look at regression trees. Regression trees use a concept called recursive partitioning. Recursive partitioning involves splitting features in a way that reduces the error the most.

The splitting is also greedy which means that the algorithm will partition the data at one point without considering how it will affect future partitions. Ignoring how a current split affects the future splits can lead to unnecessary branches with high variance and low bias.

One of the main strengths of regression trees is their ability to deal with nonlinear relationships. However, predictive performance can be hurt when a particular example is assigned the mean of a node. This forced assignment is a loss of data such as turning continuous variables into categorical variables.

in this post, we will use the “participation” dataset from the “ecdat” package to predict income based on the other variables in the dataset. Below is some initial code.

library(rpart);library(partykit);library(caret);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

There are several things we need to do to make the results easier to interpret. The “age” variable needs to be multiplied by ten as it currently shows such results as 4.5, 3, etc. Common sense indicates that a four-year-old and a three-year-old is not earning an income.

In addition, we need to convert or income variable (lnnlinc) from the log of income to regular income. This will also help to understand the results. Below is the code.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

The next step is to create our training and testing data sets. Below is the code.

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We can now develop our model. We will also use the ‘print’ command

reg.tree<-rpart(lnnlinc~.,data = train)

Below is a printout of the current tree

reg.tree
## n= 636 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 636 390503700000  48405.08  
##     2) educ< 11.5 473 127460900000  43446.69  
##       4) educ< 9.5 335  70269440000  40758.25   ##         8) foreign=yes 129  10617380000  36016.12 * ##         9) foreign=no 206  54934520000  43727.84 * ##       5) educ>=9.5 138  48892370000  49972.98 *
##     3) educ>=11.5 163 217668400000  62793.52  
##       6) age< 34.5 79  34015680000  51323.86  
##        12) age< 25.5 12    984764800  34332.97 * ##        13) age>=25.5 67  28946170000  54367.01 *
##       7) age>=34.5 84 163486000000  73580.46  
##        14) lfp=yes 36  23888410000  58916.66 *
##        15) lfp=no 48 126050900000  84578.31  
##          30) educ< 12.5 29  86940400000  74425.51  
##            60) age< 38.5 8    763764600  57390.34 * ##            61) age>=38.5 21  82970650000  80915.10  
##             122) age>=44 14  34091840000  68474.57 *
##             123) age< 44 7  42378600000 105796.20 * ##          31) educ>=12.5 19  31558550000 100074.70 *

I will not interpret all of this but here is a brief description use numbers 2,4, and 8. If the person has less than 11.5 years of education (473 qualify) If the person has less than 9.5 years of education (335 of the 473 qualify) If the person is a foreigner (129 of the 335 qualify) than their average salary is 36,016.12 dollars.

Perhaps now you can see how some data is lost. The average salary for people in this node is 36,016.12 dollars but probably nobody earns exactly this amount

If what I said does not make sense. Here is an actual plot of the current regression tree.

plot(as.party(reg.tree))

1

The little boxes at the bottom are boxplots of that node.

Tree modification

We now will make modifications to the tree. We will begin by examining the cptable. Below is the code

reg.tree$cptable
##           CP nsplit rel error    xerror      xstd
## 1 0.11619458      0 1.0000000 1.0026623 0.1666662
## 2 0.05164297      1 0.8838054 0.9139383 0.1434768
## 3 0.03469034      2 0.8321625 0.9403669 0.1443843
## 4 0.02125215      3 0.7974721 0.9387060 0.1433101
## 5 0.01933892      4 0.7762200 0.9260030 0.1442329
## 6 0.01242779      5 0.7568810 0.9097011 0.1434606
## 7 0.01208066      7 0.7320255 0.9166627 0.1433779
## 8 0.01046022      8 0.7199448 0.9100704 0.1432901
## 9 0.01000000      9 0.7094846 0.9107869 0.1427025

The cptable shares a lot of information. First, cp stands for cost complexity and this is the column furthest to the left. This number decreases as the tree becomes more complex. “nsplit” indicates the number of splits in the tree. “rel error” as another term for the residual sum of squares or RSS error. The ‘xerror’ and ‘xstd’ are the cross-validated average error and standard deviation of the error respectively.

One thing we can see from the cptable is that 9 splits has the lowest error but 2 splits have the lowest cross-validated error. Below we will look at a printout of the current table.

We will now make a plot of the complexity parameter to determine at what point to prune the tree. Pruning helps in removing unnecessary splits that do not improve the model much. Below is the code. The information in the plot is a visual of the “cptable”

plotcp(reg.tree)

1

It appears that a tree of size 2 is the best but this is boring. The next lowest dip is a tree of size 8. Therefore, we will prune our tree to have a size of 8 or eight splits. First, we need to create an object that contains how many splits we want. Then we use the “prune” function to make the actually modified tree.

cp<-min(reg.tree$cptable[8,])
pruned.reg.tree<-prune(reg.tree,cp=cp)

We will now make are modified tree

plot(as.party(pruned.reg.tree))

1

The only difference is the loss of the age nod for greater or less than 25.5.

Model Test

We can now test our tree to see how well it performs.

reg.tree.test<-predict(pruned.reg.tree,newdata=test)
reg.tree.resid<-reg.tree.test-test$lnnlinc
mean(reg.tree.resid^2)
## [1] 431928030

The number we calculated is the mean squared error. This number must be compared to models that are developed differently in order to assess the current model. By its self it means nothing.

Conclusion

This post exposed you to regression trees. This type of tree can be used to m ake numeric predictions in nonlinear data. However, with the classification comes a loss of data as the uniqueness of each example is lost when placed in a node.

Leave a Reply