Data Science Application: Titanic Survivors

This post will provide a practical application of some of the basics of data science using data from the sinking of the Titanic. In this post, in particular, we will explore the dataset and see what we can uncover.

Loading the Dataset

The first thing that we need to do is load the actual datasets into R. In machine learning, there are always at least two datasets. One dataset is the training dataset and the second dataset is the testing dataset. The training is used for developing a model and the testing is used for checking the accuracy of the model on a different dataset. Downloading both datasets can be done through the use of the following code.

url_train <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
url_test <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
training <- read.csv(url_train)
testing <- read.csv(url_test)

What Happen?

  1. We created the variable “url_train” and put the web link in quotes. We then repeat this for the test data set
  2. Next, we create the variable “training” and use the function “read.csv” for our “url_train” variable. This tells R to read the csv file at the web address in ‘url_train’. We then repeat this process for the testing variable

Exploration

We will now do some basic data exploration. This will help us to see what is happening in the data. What to look for is endless. Therefore, we will look at a few basics things that we might need to know. Below are some questions with answers.

  1. What variables are in the dataset?

This can be found by using the code below

str(training)

The output reveals 12 variables

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 354 273 16 555 516 625 413 577 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Here is a better explanation of each

  • PassengerID-ID number of each passenger
  • Survived-Who survived as indicated by a 1 and who did not by a 0
  • Pclass-Class of passenger 1st class, 2nd class, 3rd class
  • Name-The full name of a passenger
  • Sex-Male or female
  • Age-How old a passenger was
  • SibSp-Number of siblings and spouse a passenger had on board
  • Parch-Number of parents and or children a passenger had
  • Ticket-Ticket number
  • Fare-How much a ticket cost a passenger
  • Cabin-Where they slept on the titanic
  • Embarked-What port they came from

2. How many people survived in the training data?

The answer for this is found by running the following code in the R console. Keep in mind that 0 means died and 1 means survived

> table(training$Survived)
  0   1 
549 342

Unfortunately, unless you are really good at math these numbers do not provide much context. It is better to examine this using percentages as can be found in the code below.

> prop.table(table(training$Survived))

        0         1 
0.6161616 0.3838384 

These results indicate that about 62% of the passengers died while 38% survived in the training data.

3. What percentage of men and women survived?

This information can help us to determine how to setup a model for predicting who will survive. The answer is below. This time we only look percentages.

> prop.table(table(training$Sex, training$Survived), 1)
        
                 0         1
  female 0.2579618 0.7420382
  male   0.8110919 0.1889081

You can see that being male was not good on the titanic. Men died in much higher proportions compared to women.

4. Who survived by class?

The code for this is below

> prop.table(table(train$Pclass, train$Survived), 1)
   
            0         1
  1 0.3703704 0.6296296
  2 0.5271739 0.4728261
  3 0.7576375 0.2423625

Here is a code for a plot of this information followed by the plot

plot(deathbyclass, main="Passenger Fate by Traveling Class", shade=FALSE, 
+      color=TRUE, xlab="Pclass", ylab="Survived")
Rplot

3rd class had the highest mortality rate. This makes sense as 3rd class was the cheapest tickets.

5. Does age make a difference in survival?

We want to see if age matters in survival. It would make sense that younger people would be more likely to survive. This might be due to parents give their kids a seat on the lifeboats and younger singles pushing their way past older people.

We cannot use a plot for this because we would have several dozen columns on the x-axis for each year of life. Instead, we will use a box plot based on survived or died to see. Below is the code followed by a visual.

boxplot(training$Age ~ training$Survived, 
        main="Passenger Fate by Age",
        xlab="Survived", ylab="Age")

Rplot01

As you can see, there is little difference in terms of who survives based on age. This means that age may not be a useful predictor of survival.

Conclusion

Here is what we know so far

  • Sex makes a major difference in survival
  • Class makes a major difference in survival
  • Age may not make a difference in survival

There is so much more we can explore in the data. However, this is enough for beginning to lay down criteria for developing a model.

2 thoughts on “Data Science Application: Titanic Survivors

  1. Pingback: Data Science Application: Titanic Survivors | E...

  2. Pingback: Decisions Trees with Titanic | educationalresearchtechniques

Leave a Reply