In this post, we will take a look at regression trees. Regression trees use a concept called recursive partitioning. Recursive partitioning involves splitting features in a way that reduces the error the most.
The splitting is also greedy which means that the algorithm will partition the data at one point without considering how it will affect future partitions. Ignoring how a current split affects the future splits can lead to unnecessary branches with high variance and low bias.
One of the main strengths of regression trees is their ability to deal with nonlinear relationships. However, predictive performance can be hurt when a particular example is assigned the mean of a node. This forced assignment is a loss of data such as turning continuous variables into categorical variables.
in this post, we will use the “participation” dataset from the “ecdat” package to predict income based on the other variables in the dataset. Below is some initial code.
library(rpart);library(partykit);library(caret);library(Ecdat)
data("Participation") str(Participation)
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
There are several things we need to do to make the results easier to interpret. The “age” variable needs to be multiplied by ten as it currently shows such results as 4.5, 3, etc. Common sense indicates that a four-year-old and a three-year-old is not earning an income.
In addition, we need to convert or income variable (lnnlinc) from the log of income to regular income. This will also help to understand the results. Below is the code.
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
The next step is to create our training and testing data sets. Below is the code.
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
We can now develop our model. We will also use the ‘print’ command
reg.tree<-rpart(lnnlinc~.,data = train)
Below is a printout of the current tree
reg.tree
## n= 636
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 636 390503700000 48405.08
## 2) educ< 11.5 473 127460900000 43446.69
## 4) educ< 9.5 335 70269440000 40758.25 ## 8) foreign=yes 129 10617380000 36016.12 * ## 9) foreign=no 206 54934520000 43727.84 * ## 5) educ>=9.5 138 48892370000 49972.98 *
## 3) educ>=11.5 163 217668400000 62793.52
## 6) age< 34.5 79 34015680000 51323.86
## 12) age< 25.5 12 984764800 34332.97 * ## 13) age>=25.5 67 28946170000 54367.01 *
## 7) age>=34.5 84 163486000000 73580.46
## 14) lfp=yes 36 23888410000 58916.66 *
## 15) lfp=no 48 126050900000 84578.31
## 30) educ< 12.5 29 86940400000 74425.51
## 60) age< 38.5 8 763764600 57390.34 * ## 61) age>=38.5 21 82970650000 80915.10
## 122) age>=44 14 34091840000 68474.57 *
## 123) age< 44 7 42378600000 105796.20 * ## 31) educ>=12.5 19 31558550000 100074.70 *
I will not interpret all of this but here is a brief description use numbers 2,4, and 8. If the person has less than 11.5 years of education (473 qualify) If the person has less than 9.5 years of education (335 of the 473 qualify) If the person is a foreigner (129 of the 335 qualify) than their average salary is 36,016.12 dollars.
Perhaps now you can see how some data is lost. The average salary for people in this node is 36,016.12 dollars but probably nobody earns exactly this amount
If what I said does not make sense. Here is an actual plot of the current regression tree.
plot(as.party(reg.tree))
The little boxes at the bottom are boxplots of that node.
Tree modification
We now will make modifications to the tree. We will begin by examining the cptable. Below is the code
reg.tree$cptable
## CP nsplit rel error xerror xstd
## 1 0.11619458 0 1.0000000 1.0026623 0.1666662
## 2 0.05164297 1 0.8838054 0.9139383 0.1434768
## 3 0.03469034 2 0.8321625 0.9403669 0.1443843
## 4 0.02125215 3 0.7974721 0.9387060 0.1433101
## 5 0.01933892 4 0.7762200 0.9260030 0.1442329
## 6 0.01242779 5 0.7568810 0.9097011 0.1434606
## 7 0.01208066 7 0.7320255 0.9166627 0.1433779
## 8 0.01046022 8 0.7199448 0.9100704 0.1432901
## 9 0.01000000 9 0.7094846 0.9107869 0.1427025
The cptable shares a lot of information. First, cp stands for cost complexity and this is the column furthest to the left. This number decreases as the tree becomes more complex. “nsplit” indicates the number of splits in the tree. “rel error” as another term for the residual sum of squares or RSS error. The ‘xerror’ and ‘xstd’ are the cross-validated average error and standard deviation of the error respectively.
One thing we can see from the cptable is that 9 splits has the lowest error but 2 splits have the lowest cross-validated error. Below we will look at a printout of the current table.
We will now make a plot of the complexity parameter to determine at what point to prune the tree. Pruning helps in removing unnecessary splits that do not improve the model much. Below is the code. The information in the plot is a visual of the “cptable”
plotcp(reg.tree)
It appears that a tree of size 2 is the best but this is boring. The next lowest dip is a tree of size 8. Therefore, we will prune our tree to have a size of 8 or eight splits. First, we need to create an object that contains how many splits we want. Then we use the “prune” function to make the actually modified tree.
cp<-min(reg.tree$cptable[8,])
pruned.reg.tree<-prune(reg.tree,cp=cp)
We will now make are modified tree
plot(as.party(pruned.reg.tree))
The only difference is the loss of the age nod for greater or less than 25.5.
Model Test
We can now test our tree to see how well it performs.
reg.tree.test<-predict(pruned.reg.tree,newdata=test)
reg.tree.resid<-reg.tree.test-test$lnnlinc
mean(reg.tree.resid^2)
## [1] 431928030
The number we calculated is the mean squared error. This number must be compared to models that are developed differently in order to assess the current model. By its self it means nothing.
Conclusion
This post exposed you to regression trees. This type of tree can be used to m ake numeric predictions in nonlinear data. However, with the classification comes a loss of data as the uniqueness of each example is lost when placed in a node.