Predicting the Survival of Titanic Passengers (Part 2)

In my previous blog post, we learned a bit about what affects the survival of titanic passengers by conducting exploratory data analysis and visualizing the data. Then, the data was wrangled in order to prepare for modelling. In this blog post, I will use machine learning algorithms available at Python’s Scikit-learn library to predict which passengers in the testing data survived. A Decision Tree Classifier is used as an example and then its hyperparamaters are tuned to see if it improves prediction accuracy. I’ll also try using an ensemble of models to predict the results.

2. Modelling the data and tuning model hyperparameters
2.1 Importing libraries
The model that we’ll use, Decision Tree Classifier, is available in the Scikit-learn library. Other functions that can be used for validating our model and tuning its hyperparameters are also imported.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ShuffleSplit, cross_validate, GridSearchCV
2.2 Preparing training and validation data

In the previous blog post, we have combined the training and testing data for data wrangling. Now, let’s split them back into training and testing set. Our model will be fit into the training data.

# only the features will be included as "X", while Survived, the target variable, will be "Y"
full_X = full_dummies.drop('Survived', axis = 1)
full_Y = full_dummies[['Survived']]

# the first 890 rows were the training data
train_valid_X = full_X[0:891]
train_valid_Y = full_Y[0:891]

# the rest were the testing data
test_X = full_X[891:]

To avoid overfitting, the training data will be further split into training and validation set. The model will first be trained on the training set and then applied on the validation set. Since the model was never trained on the validation set, if it is still able to predict the target (“Survived”) on the “unseen” validation set on similar level of accuracy with the training set, we can be fairly confident that the model generalizes pretty well across different passengers’ data. Otherwise, the model might be overfitted – it is only useful for predicting data in the training set.

We will use shuffle split, one of the many cross validation techniques, to split the training data into training and validation set. Here’s how it works:

shuffle split
Scikit-learn allows us to specify the number of reshuffling and splitting iterations and size of the training and validation set. I will reshuffle and split the training data 10 times (n_splits=10), in which each time 60% of the data will be used as the training set while 30% of the data will be used as the validation (it’s called “test” in this function) set. The remaining 10% is intentionally left out to allow more randomness in our sampling.
cv_split = ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 )
2.3 Introduction to the Decision Tree Classifier

In this blog post, I’ll use the Decision Tree Classifier to model the data because it’s easy to interpret and provides information on the key classification features. Before we jump in to start training our data, allow me to briefly explain what a Decision Tree Classifier is.

A Decision Tree Classifier identifies the most effective feature that classifies the data into 2 subsets, “effectiveness” is assessed by a certain criteria. The default criteria used in Scikit-learn is to minimize Gini impurity. Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. For example, if everything in the subset is “A”, then the Gini impurity would be 0 because whenever a random element is selected from the subset, we would always make the correct guess that it’s an “A”. On the other hand, if there are 3 “A”s and 3 “B”s in the subset, the Gini impurity would be 0.5 because we would make a wrong guess half of the time. In other words, with minimizing Gini impurity as an objective, a Decision Tree Classifier will select a feature which is able to identify subsets that contain as many elements from the same class as possible. That way, the chance of making a wrong guess of the label of a randomly chosen element from the subsets is minimized.

Here is a visualized example of how a decision tree classifies 3 different Iris flower species, setosa, versicolor and virginica:

iris_tree(image credit: http://scikit-learn.org/stable/modules/tree.html)

  1. The decision tree identifies a feature – whether the length of the petal of an Iris flower is shorter than 2.45cm – that separates the data into 2 subsets while minimizing Gini impurity.
  2. According to the orange box, all Iris flowers with petals shorter than 2.45cm are Iris setosas (there are 50 of them, as shown by “value = [50, 0, 0]”). Since all flowers in this group are of the same species, Gini impurity is 0. Whereas for the white box, 50 of them are Iris versicolors while the remaining 50 are Iris virginicas (values = [0, 50, 50]) therefore Gini impurity is 0.5 as there is 50% chance of making a wrong guess by, for example, guessing a random sample from this subset is an Iris versicolor.
  3. The decision tree identifies a feature to further seperate the subset in the white box – whether the width of the petal of an Iris flower is less than 1.75cm. Referring to the green box, out of all 54 flowers (“samples = 54”) with petal width less than 1.75cm, most of them (49) are Iris versicolors whereas in the purple box, 45 out of 46 flowers with petal width more than 1.75cm are Iris virginicas.

Up to this level we could make a pretty accurate guess of the species of an Iris flower according to its petal length and width – those which petals are shorter than 2.45 cm are Iris setosas, the remaining ones which petals are narrower than 1.75 cm are mostly Iris versicolors and the rest are mostly Iris virginicas. We can stop here if we think this level of accuracy is good enough, otherwise, we can go further down the decision tree to identify more features to further separate our data.

2.4 Modelling data

After gaining some basic understanding of a Decision Tree Classifier, let’s train our model using default hyperparameters in Scikit-learn and use shuffle split for cross validation. Then, its hyperparameters will be tuned to see if it improves the model score.

# default settings of a Decision Tree Classifier
dtree = DecisionTreeClassifier(random_state = 0)

# implement shuffle split for cross validation
base_results = cross_validate(dtree, train_valid_X, train_valid_Y, cv  = cv_split)
dtree.fit(train_valid_X, train_valid_Y)

print "Decision tree parameters (default): ", dtree.get_params()
print "-"*10
print "Decision tree mean training score (default): {:.2f}". format(base_results['train_score'].mean()*100) 
print "Decision tree mean validation score (default): {:.2f}". format(base_results['test_score'].mean()*100)
Decision tree parameters (default):  {'presort': False, 'splitter': 'best', 'min_impurity_decrease': 0.0, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'criterion': 'gini', 'random_state': 0, 'min_impurity_split': None, 'max_features': None, 'max_depth': None, 'class_weight': None}
----------
Decision tree mean training score (default): 99.19
Decision tree mean validation score (default): 76.23

It seems that our model is doing pretty well given the high model scores for both the training and validation set. However, note that there is a rather big difference between the scores of the traning and validation set. This might be a sign of overfitting.

2.5 Tuning model hyperparameters

Next up, let’s tune the hyperparameters of our Decision Tree Classifier to see if we can improve our model’s performance and reduce overfitting. A parameter grid is used to specify the combinations of hyperparameters we would like to set on the decision tree. Then, the GridSearchCV function will be used to identify the combination of hyperparameters that has the best performance based on a specified criteria. We will define the parameter grid to try 12 different combinations of 2 hyperparameters, “criterion” and “max_depth”, e.g. criterion = gini with max_depth = 2, criterion = gini with max_depth = 4 etc., as illustrated in this table.

random_
state = 0
criterion
gini entropy
max_depth 2 2
4 4
6 6
8 8
10 10
None None
  • Max_depth: how many “layers” your decision tree will have at most. The depth of the Iris flower decision tree is 5.
  • Criterion: the “goal” of the decision tree when separating our data into subsets. In addition to minimizing Gini impurity, we would try another criteria – minimizing entropy. To minimize entropy basically means to minimize to “messiness” within a subset of your data, i.e. to try to separate your data into subsets of which elements within a subset are as homegenous to each other as possible. Similar concept to Gini impurity but the math behind is different, here’s a website if you would like to learn more: https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

A scoring mechanism to determine which model is the best can also be specified. I’ll use area under the Receiver Operating Characteristic (ROC) curve. The ROC curve visualizes the true positive and false positive rate of a classifier. If a classifier is below the diagonal line (i.e. area under the curve <0.5), our classifier is worse than a random guess. On the other hand, a perfect classifier will be at the top left corner (i.e. area under the curve = 1) as all the “positives” are flagged correctly and none of the “negatives” are incorrectly flagged as “positives”. The uppper right corner means that a classifier is flagging everything (all positives) and the lower left corner means that a classifier is flagging nothing (all negatives).

roc(image credit: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html)

Let’s see which are the best hyperparameters for our decision tree.

# defining the parameter grid
param_grid = {'criterion': ['gini', 'entropy'],'max_depth': [2,4,6,8,10,None],'random_state': [0]}

# search for the best combination of hyperparameters based on area under the ROC curve ('roc_auc')
tune_model = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
tune_model.fit(train_valid_X, train_valid_Y)

print "Decision tree parameters (tuned): ", tune_model.best_params_
print "Decision tree roc_auc score (tuned): ", tune_model.best_score_
print "Decision tree mean training score (tuned): {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100) 
print "Decision tree mean validation score (tuned): {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100)
Decision tree parameters (tuned):  {'random_state': 0, 'criterion': 'entropy', 'max_depth': 4}
Decision tree roc_auc score (tuned):  0.855571685826
Decision tree mean training score (tuned): 89.97
Decision tree mean validation score (tuned): 85.56

The best model for our training data is one that minimizes entropy with a max_depth of 4. Its ROC AUC score is >0.5, so it is better than a random guess. Compared to the training and validation score of the default decision tree (99 and 76), the tuned model has a higher validation score and less difference between the training and validation score, i.e. less overfitting.

3. Implementation

We can repeat the above steps for other classification models and use one that produces the most accurate prediction to submit to Kaggle. Alternatively, we can use the ensemble method, in which the “joint effort” of several models are used for prediction. In this example, I’ll use hard voting among 7 models which means the final prediction will be made according to the majority prediction result, in other words, if most models predict that a particular passenger survived, the final prediction will be “Survived” = 1.

3.1 Importing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
3.2 Modeling data using ensemble method
# list of models to use
vote_est = [
    ('dtc', DecisionTreeClassifier()),
    ('gbc', GradientBoostingClassifier()),
    ('rfc', RandomForestClassifier()),
    ('lr', LogisticRegression()), 
    ('gnb', GaussianNB()),
    ('knn', KNeighborsClassifier()),
    ('svc', SVC(probability=True))
]

# specify "hard voting" will be used in the VotingClassifier
vote_hard = VotingClassifier(estimators = vote_est , voting = 'hard')
# shuffle split is used for cross validaiton
vote_hard_cv = cross_validate(vote_hard, train_valid_X, train_valid_Y, cv  = cv_split)
vote_hard.fit(train_valid_X, train_valid_Y)

# print scores of the ensemble
print("Ensemble mean training score: {:.2f}". format(vote_hard_cv['train_score'].mean()*100)) 
print("Ensemble mean validation score: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
Ensemble mean training score: 93.22
Ensemble mean validation score: 83.13

While there is a slight improvement in training score, the validation score of the ensemble is actually lower than that of the tuned Decision Tree Classifier. Bare in mind that our ensemble used default hypermaraters of the models. We can tune the hyper parameters of each of the models in the ensemble to further improve their performance using the GridSearchCV function, then train the data using an ensemble of tuned models.

3.3 Submission to Kaggle

Anyway, I’ll stop here for this blog post and submit the prediction results of the ensemble to Kaggle.

# using the ensemble to make predictions on the testing data
prediction = vote_hard.predict(test_X).astype(int)

# format the prediction results and export as csv for submitting to Kaggle
test_results = pd.DataFrame()
test_results['PassengerId'] = test_X.index + 1
test_results['Survived'] = prediction
test_results.set_index('PassengerId', inplace=True)
test_results.to_csv('titanic_submission.csv')

kaggle rank

Voila, my first Kaggle competition submission! I ranked #2956 with a model score of around 0.79, not too bad for my first try. Comment below your score and the model you used! 🙂


This is part 2 of a 2-series blog post on the classical Kaggle competition of predicting the survival of Titanic passengers. Click here for Part 1. Special credits to this Kaggle kernel as a useful learning guide in completing part 2 of this project.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s