In this post, we will make predictions about Titanic survivors using decision trees. The advantage of decisions trees is that they split the data into clearly defined groups. This process continues until the data is divided into extremely small subsets. This subsetting is used for making predictions.
We are assuming you have the data and have viewed the previous machine learning post on this blog. If not please click here.
You need to install the ‘rpart’ package from the CRAN repository as the package contains a decision tree function.You will also need to install the following packages
Each of these packages plays a role in developing decision trees
Building the Model
Once you have installed the packages. You need to develop the model below is the code. The model uses most of the variables in the data set for predicting survivors.
tree <- rpart(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,data=train, method=’class’)
The ‘rpart’ function is used for making the classification tree aka decision tree.
We now need to see the tree we do this with the code below
Plot makes the tree and ‘text’ adds names
You can probably tell that this is an ugly plot. To improve the appearance we will run the following code.
Below is the revised plot
This looks much better
How to Read
Here is one way to read a decision tree.
- At the top node, keep in mind we are predicting Survival rate. There is a 0 or 1 in all of the ‘buckets’. This number represents how the bucket voted. If more than 50% perish than the bucket votes ‘0’ or no survivors
- Still looking at the top bucket, 62% of the passengers die while 38% survived before we even split the data.The number under the node tells what percentage of the sample is in this “bucket”. For the first bucket 100% of the sample is represented.
- The first split is based on sex. If the person is male you look to the left. For males, 81% of them died compared to 19% who survived and the bucket votes 0 for death. 65% of the sample is in this bucket
- For those who are not male (female) we look to the right and see that only 26% died compared to 74% who survived leading to this bucket voting 1 for survival. This bucket represents 35% of the entire sample.
- This process continues all the way down to the bottom
Decisions trees are useful for making ‘decisions’ about what is happening in data. For those who are looking for a simple prediction algorithm, decision trees is one place to begin