Predicting the Survival of Titanic Passengers (Part 1)

This is a classic project for those who are starting out in machine learning aiming to predict which passengers will survive the Titanic shipwreck. I will give this project a try using the training and testing data obtained from Kaggle.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

I will tackle this challenge according to the following steps:

1. Data exploration, visualization and wrangling
I’ll explore the data set to get a sense of what information it carries, visualization is a great tool to help us gain more insights about the data. Data wrangling will be performed if needed.  At this stage, we will also be able to draw some preliminary conclusions on the features that are correlated to a passenger’s survival.

2. Modelling the data and tuning model hyperparameters
In addition to using the default hyperparameters of the machine learning models in Scikit-learn, I’ll try to tune the hyperparameters to see if this improves the accuracy of the model. Decision Tree Classifier will be used as an example.

3. Implementation
Instead of using a single model to predict survival of passengers in the testing data, I’ll use a voting classifier so that we can use the majority voting results of multiple models as our prediction. After it’s all set and done, the results will be submitted to Kaggle!

This blog post will focus on the first step, data exploration, visualization and wrangling. Click here to view the blog post on modelling the data.

1. Data exploration, visualization and wrangling
1.1 Importing libraries and data
import numpy as np
import pandas as pd

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# Configure Visualizations
%matplotlib inline'ggplot')
pylab.rcParams['figure.figsize'] = 8, 6
train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')

print train.shape
print test.shape
(891, 12)

(418, 11)

The training data contains information of 891 Titanic passengers. There are 12 columns describing features of each passenger, of which one is whether the passenger survived the ship wreck. The testing data contains pretty much the same information except the survival of the passengers. This is the information we would like to predict in the end.

1.2 Data exploration and visualization
These are the features included in the data and their definitions.
PassengerId Unique identifier of each passenger
Survived Whether the passenger survived

(0 = No, 1 = Yes)

Pclass Ticket class

(1 = 1st class, 2 = 2nd class, 3 = 3rd class)

Name Name of the passenger, title of the passenger is also included
Sex Male/female
Age Age in years
SibSp # of siblings / spouses aboard the Titanic
Parch # of parents / children aboard the Titanic
Ticket Ticket number
Fare Fare amount
Cabin Cabin number
Embarked Port of embarkation
(C = Cherbourg, Q = Queenstown, S = Southampton)

With these features, we can start asking questions about what were the factors affecting the survival of passengers. For example,

  • Are children and elderly more likely to survive?
  • Are female more likely to survive than male?
  • Does social class (represented by Pclass, title extracted from name) affect survival
  • Does social class affect the survival rate of different sex? Are women from the lower class more likely to survive than male from the upper class?
  • What about port of embarkation? Is it a proxy for social class or other demographic information such as age?
  • How does the number of family members on board affect survival? Does it make a difference if children are accompanied by their parents?

Then, we use the describe function to learn more about the data.

train.describe(include = 'all')
  • There are missing values in Age, Cabin and Embarked. Those will have to be filled in.
  • Cabin has a lot of missing values(77%!). In this case, dropping it might be a better option as it might be difficult to identify/compute appropriate figures that are representative enough to replace the missing values.
  • Some rows contain “0” value in Fare. It seems pretty unreasonable that a ticket to the Titanic would cost nothing. It can be considered as a missing value.
  • SibSp and Parch both provide information about the number of family members accompanying the passenger aboard. We can consider summing these 2 variables to create a new feature, family size.
  • PassengerId, Name and Ticket seemed to be unique identifiers that were irrelevant to a passenger’s survival. However, title can be extracted from Name as a proxy of the socioeconomic status / marital status / sex of the passenger.
  • There are some categorical data which will need to be converted to dummy variables for mathematical analysis, such as Pclass and Embarked.
  • Similar to the training data, the testing data has missing values in Age, Embarked and Fare. There are lots of missing values in Cabin as well.

Data visualization is a crucial part of data analysis. It deepens our understanding of the data and helps us to identify which features are useful in predicting the survival of a passenger and determine how best to wrangle the data . For the former, it’s important to note that differences between survival for different values of a feature/an independent variable is what will be used to separate the target variable (survival in this case) in the model. For example, if survival of different sexes, male and female, were different, it means that sex could be a relevant feature to predict whether a passenger had survived. Otherwise, if the survival of both sexes were about the same, then it would not be a good variable for our predictive model.

Heatmaps help us understand the correlation between features as well as their correlation with the target variable. Let’s look at the heatmap of our training data.

plt.subplots(figsize = (12, 10))
cmap = sns.diverging_palette(220, 10,  = True)
    cmap = cmap,
    cbar_kws={'shrink': .9}, 
    annot = True, 
    annot_kws = {'fontsize': 12}
  • Survival has some correlation with Pclass and Fare
  • Pclass has some correlation with Age and Fare
  • SipSp has some correlation with Parch, Age and Fare
  • Parch has some correlation with Age, Fare

(I have defined “some correlation” as those with correlation >0.1 in absolute value)

Let’s look at the features one by one. We’ll look at the numerical features – Age, Fare, Parch and SibSp first, followed by the remaining categorical features.

(i) Age
facet = sns.FacetGrid(train, hue = 'Survived', aspect = 4) , 'Age' , shade= True )
facet.set(xlim=(0, train['Age'].max() ) )
Hmm.. it looks like children are more likely to survive. What if we look into both age and sex?
facet = sns.FacetGrid(train, hue = 'Survived', aspect=4 , row = 'Sex'), 'Age', shade= True )
facet.set(xlim=(0, train['Age'].max() ) )

age2Interesting, looks like boys are more likely to survive than men but the opposite goes for girls and women.

According to the heatmap, age has rather high correlation with Pclass (-0.37), Parch (-0.19) and SipSp (-0.31). Let’s examine these features in pairs.

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = [15,5])
sns.boxplot(x='Pclass', y='Age', ax=ax1, data=train)
sns.boxplot(x='Parch', y='Age', ax=ax2, data=train)
sns.boxplot(x='SibSp', y='Age', ax=ax3, data=train)
  • Pclass: 1st class passengers were relatively older than 2nd class passengers, while 3rd class passengers were generally younger. This might be because older passengers had accumulated more wealth or achieved a higher social status than younger passengers.
  • Parch: Younger passengers traveled with 1-2 parent/child. It could be because younger passengers generally didn’t have children and were travelling with their parents, while older passengers generally either travel alone (0) or with their children (and maybe their parents too!) so that they will have more parent/child on board.
  • SipSp: Younger passengers traveled with more sibling/spouse (>=3). It’s probably because younger passengers generally traveled with their siblings, i.e. a larger group, while older passengers traveled with their spouse (one person) in general.

In the beginning of our analysis, we found that there are missing values for Age. We could try to use our observations to separate passengers into different groups according to their Sex, Pclass, Parch and SipSp and then impute their Age values depending on which group they belong to. There are 2 ways of imputing the Age value:

  1. Calculate the mean Age of each group. Passengers with missing Age values will be assigned the mean Age of their repective groups.
  2. Randomly choose the Age value of a passenger in a group and assign it to a passenger with missing Age value in the same gorup.

The shortcoming of method #1 is that if a large number of passengers with missing Age value all belong to the same group, they will all be assigned the same Age. This could have huge effects to the distrubtion of Age. Therefore, let’s use method #2. We will do this later in this blog post.

(ii) Fare
facet = sns.FacetGrid(train, hue = 'Survived', aspect = 4) , 'Fare' , shade= True )
facet.set(xlim=(0, train['Fare'].max() ) )
facet = sns.FacetGrid(train, hue = 'Survived', aspect=4 , row = 'Sex'), 'Fare', shade= True )
facet.set(xlim=(0, train['Fare'].max() ) )

We observe that passsengers who paid a higher Fare were more likely to survive, regardless of Sex.

Previously, we noticed that there are a few missing values for Fare (including 0 and null). In order to impute the missing values, let’s try to observe the distribution of Fare across Pclass. Note that Fare and Pclass has a rather high correlation of -0.55 according to the heatmap.

print "Number of missing Fare values in training data:", train['Fare'].isnull().value_counts().get(True)
print "Number of missing Fare values in testing data:", test['Fare'].isnull().value_counts().get(True)
print "Entries with Fare value = 0 in training & testing data:", np.sum(train['Fare'] == 0) + np.sum(test['Fare'] == 0)

ax = sns.boxplot(x='Pclass', y='Fare', data=train)
Number of missing Fare values in training data: None
Number of missing Fare values in testing data: 1
Entries with Fare value = 0 in training & testing data: 17

Since there aren’t a lot of missing Fare values, we could replace them with the median Fare according to the passengers’ Pclass. Note that median is chosen instead of mean due to outliers observed, in particular, in Pclass = 1.

(iii) Parch and SibSp
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = [15,5])
sns.barplot(x = 'Parch', y = 'Survived', ax = ax1, data = train)
sns.barplot(x = 'SibSp', y = 'Survived', ax = ax2, data = train)
family1It seems that passengers who travelled with a small group of family members (1-2 Parch or SibSp) were more likely to survive than those who travelled alone or with a large group. Let’s group Parch and SibSp into FamilySize for further observation.
train['FamilySize'] = train['Parch'] + train['SibSp'] + 1 # +1 to include the passenger himself/herself
ax = sns.barplot(x = 'FamilySize', y = 'Survived', data = train)
Again, it looks like smaller families of 2-4 were the most likely to survive. The new feature we just created, FamilySize, has some correlation with Age (-0.30) and Fare (0.22), similar to Parch and SibSp according to the table below.


(iv) Pclass
ax = sns.barplot(x = 'Pclass', y = 'Survived', data = train).set_title("Survival of All Passengers in Different Ticket Classes")
facet = sns.FacetGrid(train, aspect=2, size = 4, row = 'Sex'), 'Pclass', 'Survived')

1st class passengers were the most likely to survive, followed by those in the 2nd class then the 3rd class. The same goes for both male and female.

(v) Sex
ax = sns.barplot(x = 'Sex', y = 'Survived', data = train)

Female were more likely to survive than male.

(vi) Embarked
ax = sns.barplot(x = 'Embarked', y = 'Survived', data = train)
ax.set_title("Survival of Passengers Embarked at Each Port")
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize = [15,15])
sns.barplot(x='Sex', y='Survived', hue = 'Embarked', ax=ax1, data=train).set_title("Survival by Sexes by Ports of Embarkation")
sns.barplot(x='Pclass', y='Survived', hue = 'Embarked', ax=ax2, data=train).set_title("Survival by Ticket Classes by Ports of Embarkation")
sns.boxplot(x='Embarked', y='Fare', ax=ax3, data=train).set_title("Fare Paid by Passengers Embarked at Each Port")
sns.boxplot(x='Embarked', y='Age', ax=ax4, data=train).set_title("Age of Passengers Embarked at Each Port")

Most passengers Embarked on Port C (Cherbourg). Port of Embarkation didn’t seem to vary significantly for passengers of different Sex, Pclass, Fare or Age. Given that we only have a small number of missing values in Embarked (2 only), we will fill them in with the mode, Port C.

(vii) Name
While Name is a unique identifier that does not help to predict survival, title can be extracted from Name and might offer us some insights on how social rank/marital status correlates with survival.
0       Braund, Mr. Owen Harris
1       Cumings, Mrs. John Bradley (Florence Briggs Th…
2       Heikkinen, Miss. Laina
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)
4       Allen, Mr. William Henry
Name: Name, dtype: object
# obtain unique Titles
train['Title'] = train['Name'].map(lambda name: name.split(',')[1].split('.')[0].strip())
array([‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, ‘Don’, ‘Rev’, ‘Dr’, ‘Mme’, ‘Ms’, ‘Major’, ‘Lady’, ‘Sir’, ‘Mlle’, ‘Col’, ‘Capt’, ‘the Countess’, ‘Jonkheer’], dtype=object)
# Create a dictionary of more aggregated titles
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"


train['Title'] = train['Title'].map(Title_Dictionary)
Now that the Titles are extracted, let’s visualize it to see if it gives us some insights.
fig, ((ax1), (ax2)) = plt.subplots(2, 1, figsize = [10,15])
sns.barplot(x = 'Title', y = 'Survived', ax = ax1, data = train).set_title("Survival of Passengers by Their Titles")
sns.barplot(x='Title', y='Survived', hue = 'Sex', ax = ax2, data=train).set_title("Survival of Passengers of Different Sexes by Their Titles")
  • Married women (Mrs) were more likely to survive than single ladies (Miss).
  • Passengers with higher social status were more likely to survive, this was showed by how Master/Royalty/Officers were more likely to survive than peasants (Mr/Mrs/Miss) of the same sex.
1.3 Data Wrangling

Now that we have some idea on how the data should be cleaned up, let’s combine the training and testing data into one dataframe to process them in one go.

# dropping the new columns we created to ease processing the full dataset
train = train.drop(['FamilySize', 'Title'], axis = 1)
full = train.append(test, ignore_index = True)
1.3.1 Wrangling Numerical Data
(i) Age
We will impute the missing Age value of a passenger using the Age of a random passenger who belongs to the same Sex, Pclass, Parch and SibSp group.
# obtain the index of rows with missing Age values
index_missingage = full[full['Age'].isnull()].index.tolist()
print "There are %d missing Age values." % len(index_missingage)
There are 263 missing Age values.
for i in index_missingage:
    miss_passenger = full.iloc[i]
    # for each passenger with missing age value, find the group of passengers of the same Sex, Pclass, SibSp, Parch
    age_group = full[(full['Sex'] == miss_passenger['Sex']) & 
                     (full['Pclass'] == miss_passenger['Pclass']) &
                     (full['SibSp'] == miss_passenger['SibSp']) & 
                     (full['Parch'] == miss_passenger['Parch']) &
                     (full['Age'] > 0)]# Age > 0 to exclude those with missing Age values
    # in case there is no match, use a broader classification with only matching Sex and Pclassif len(age_group) == 0:
        age_group = full[(full['Sex'] == miss_passenger['Sex']) & 
                         (full['Pclass'] == miss_passenger['Pclass']) &
                         (full['Age'] > 0)]
    # Set the Age value of that index to be equal to that of a random sample in the same group
    full.loc[i, 'Age'] = age_group['Age'].sample(n=1).iloc[0]

# check if all missing values were imputed

False 1309
Name: Age, dtype: int64

(ii) Fare

We will impute missing Fare value of a passenger using the median Fare of the corresponding Pclass.

# mark fare = 0 as NA
full.loc[full['Fare'] == 0,'Fare'] = np.nan

index_missingfare = full[full['Fare'].isnull()].index.tolist()
print "There are %d missing Fare values." % len(index_missingfare)
 There are 18 missing Fare values.
fare_guess = [0,0,0]
for i in range(0, 3):
    guess_df = full[(full['Pclass'] == i+1)]['Fare'].dropna()
    fare_guess[i] = int(guess_df.median())
# assign the median fare of the respective ticket class of passengers with missing fare values
for x in index_missingfare:
    if full.loc[x, 'Pclass'] == 1:
        full.loc[x, 'Fare'] = fare_guess[0]
    elif full.loc[x, 'Pclass'] == 2:
        full.loc[x, 'Fare'] = fare_guess[1]
        full.loc[x, 'Fare'] = fare_guess[2]

# check to see if all missing Fare values were imputed

False 1309
Name: Fare, dtype: int64

(iii) Parch & SibSp

We will try to create a new variable, FamilySize, by summing Parch and SipSp

full['FamilySize'] = full['Parch'] + full['SibSp'] + 1
1.3.2 Wrangling categorical data – conversion to dummy variables
There are 2 ways to handle categorical data:

(a) Converting them into dummy variables
For example, there are 3 ports of embarkation, C = Cherbourg, Q = Queenstown and S = Southampton. When they are converted to dummy variables, 3 new columns are created, namely “Embarked_C”, “Embarked_Q” and “Embarked_S”. Passengers’ ports of embarkation are represented as follows:

Embarked_C Embarked_Q Embarked_S
Passenger embarked at Cherbourg 1 0 0
Passenger embarked at Queenstown 0 1 0
Passenger embarked at Southampton 0 0 1

(b) Converting them into ordinal variables
If categorical data is converted into ordinal variables, each category is assigned a number according to a certain order. Using port of embarktion again as an example, let’s say we assign “1” to Cherbourg, “2” to Queenstown and “3” to Southampton, the categorical data will be represented as follows:

Passenger embarked at Cherbourg 1
Passenger embarked at Queenstown 2
Passenger embarked at Southampton 3

Converting categorical data into numerical variables is a better option in this case because there is no particular order among the ports of embarkation. While the difference won’t be significant in this example given the small number of categories, for variables with a large number of categories, convering them into ordinal variables will implicitly assign orders to them (imagine if you have 10 categories, the last category will be assigned a value of 10!). This could distort our analysis if the categories in fact does not have an ordinal relationship. To illustrate, imagine that in a linear regression, those which category is assigned a higher ordinal value will have more significant effect to the dependent variable. Assigning dummy variables (0, 1) minimizes this problem.

(i) Pclass
pclass = pd.get_dummies(full['Pclass'], prefix = 'Pclass')
  Pclass_1 Pclass_2 Pclass_3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
(ii) Sex
sex = pd.DataFrame()
sex_dict = {'male': 1, "female": 0}
sex['Sex'] = full['Sex'].map(sex_dict)
0 1
1 0
2 0
3 0
4 1
(iii) Embarked

We will first replace the missing values with the mode, then convert Embarked to dummy variables.

embarked_raw = pd.DataFrame()
embarked_raw['Embarked'] = full['Embarked'].fillna(full['Embarked'].mode()[0])

embarked = pd.get_dummies(embarked_raw['Embarked'], prefix = 'Embarked')
  Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
(iv) Name

We will first extract the Titles of passengers, then convert them to dummy variables.

title = pd.DataFrame()

# extract titles from names
title['Title'] = full['Name'].map(lambda name: name.split( ',' )[1].split( '.' )[0].strip())

# Create a dictionary of more aggregated titles
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"


title['Title'] = title['Title'].map(Title_Dictionary)
title = pd.get_dummies(title['Title'])
  Master Miss Mr Mrs Officer Royalty
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0

Now to the last step – let’s include the dummy variables we just creating in our dataset and remove the categorical variables as well as other features such as PassengerId and Cabin that we decided to not include in our model.

full_numeric = full.drop(['Cabin', 'Embarked', 'Name', 'PassengerId', 'Pclass', 'Sex', 'Ticket'], axis = 1)

full_dummies = pd.concat([full_numeric, pclass, sex, embarked, title], axis = 1)


So that’s it for the first step of this project. Stay tuned for the next blog post in which I’ll try to predict the survival of Titanic passengers using machine learning models. Meanwhile, please feel free to leave your comments on this post.

This is part 1 of a 2-series blog post on the classical Kaggle competition of predicting the survival of Titanic passengers. Click here for Part 2. 

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s