The world’s leading publication for data science, AI, and ML professionals.

Data Science Titanic Challenge Solution

Predicting Titanic Survivors using Data Science and Machine Learning

Comparison of K-nearest neighbours classifier (KNN) and Decision Tree classifier with Titanic data set

Photo by Alonso Reyes on Unsplash
Photo by Alonso Reyes on Unsplash

Introduction

The purpose of this challenge is to predict the survivals and deaths of the Titanic disaster at the beginning of the 20th century. We will use two machine learning Algorithms for this task, K-nearest neighbours classifier (KNN) and Decision Tree classifier. We will perform basic data clean and feature engineering and compare the results of these two algorithms.

What will you learn?

You will practice two classification algorithms here, KNN and Decision Tree. You will learn how to prepare the data to achieve the best results by cleaning the data and advanced feature engineering.

Problem definition

We have two data sets. One for training (train.csv) containing survival and death information that we will use to train our model. One for testing (test.csv), without survival and death information, that we will use to test our models.

Step by step solution

If you don’t have your computer set up for data science read my article How to set up your computer for Data Science.

Create a project folder

Create a folder for a project on your computer called "Titanic-Challenge".

Download train.cs and test.csv data sets from Kaggle

https://www.kaggle.com/c/titanic/data

Place these data sets in a folder called "data" in your project folder.

Start a new notebook

Enter this folder and start Jupyter Notebook by typing a command in the Terminal/Command Prompt:

$ cd "Titanic-Challenge"

then

$ jupyter notebook

Click new in the top right corner and select Python 3.

Image by the author
Image by the author

This will open a new Jupyter Notebook in your browser. Rename the Untitled project name to your project name and you are ready to start.

Image by the author
Image by the author

If you have Anaconda installed on your computer you will already have all libraries needed for this project installed on your computer.

If you are using Google Colab, open a new notebook.

Loading libraries & Setup

First thing we usually do in a new notebook is adding different libraries we will need to use when working on the project.

Loading Titanic data

Now we need to load the data sets from the files we downloaded into variables as Pandas Data Frames.

Exploratory data analysis

It is always a good practice to look at the data.

Let’s look at the train and test data. The only difference between them is the Survived column which indicates if the passenger survived the disaster or not.

Below is also an explanation of each field in the data set.

Image by the author
Image by the author
  • PassengerId: unique ID of the passenger
  • Survived: 0 = No, 1 = Yes
  • Pclass: passenger class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Name: name of the passenger
  • Sex: passenger’s sex
  • Age: passenger’s age
  • SibSp: number of siblings or spouses on the ship
  • Parch: number of parents or children on the ship
  • Ticket: Ticket ID
  • Fare: the amount paid for the ticket
  • Cabin: cabin number
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Let’s now look at the dimensionality of the train Data Frame.

# Output
(891, 12)

We can see that the train data set has 891 records and 12 columns.

Let’s do the same for the test data set.

Image by the author
Image by the author
# Output
(418, 11)

The only difference with the test data set is the number of records which is 418 and the number of columns which is 11. We are missing the Survived column in the test data set. We will be predicting the Survived column with the machine learning model we are going to build.

# Output
There are 1309 passengers in both data sets.
891 in train data set.
418 in train data set.

What we can also see already is that we some missing data (NaN values) in our data sets. For our classification model to work effectively we will have to do something with the missing data. We will check this in details and deal with it a little bit later but for now let’s just look at the Pandas info() function so we can get an idea about the missing values.

# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

We can see that in the training data Age, Cabin and Embarked has some missing values.

# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

We have a similar situation in the test data set.

We can also check for null values using isnull() function.

# Output
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# Output
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Which confirms our earlier findings.

Now let’s look at it in a little bit more detail just to get an idea what is in it.

A good way to do it is to draw some charts from the data.

In our project we will use Matplotlib library to display charts.

We are primarily interested in characteristics of passengers who survived or not.

We will now create a function that will display whether the passangers survived or not the Titanic disaster, against a specified feature.

We will be using only train data set for that because only in this data set we have survival information.

And now let’s build the charts for selected features.

Image by the author
Image by the author

We can see that significantly more females survived than males. We have even more significant results for passengers that did not survive where females make a very small percentage in comparison to males.

Now let’s look at the passenger class.

Image by the author
Image by the author

We can see here that passengers from the 3rd class were more likely to die than passengers from the fist class, which had higher chances to survive.

These and other relationships between the features and the survival rate are very important to us and to our machine learning model.

Feature engineering

Once we have loaded our data sets and have a good understanding of the data we are working with, we will perform some feature engineering.

Feature engineering is the process of extracting features from the existing features in the data set in order to improve the performance of the machine learning model.

Usually, that means not only creating new features but also replacing missing values and removing features that do not contribute to the performance of the model.

Let’s have a look again at the missing values in our train data set.

# Output
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
Image by the author
Image by the author

Replace missing value with the median value for the column

We have many missing values in the Age column. We will fill all the missing values in the Age column with the median values for that column. Median value is "the middle" value for the columns. To make the values more accurate we will calculate the median value for each sex separately. We will also perform this for both train and test data set.

Image by the author
Image by the author

We can see that all NaN values were replaced with the number.

# Output
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

And we no longer have null values in the train data set.

What we are also going to do here with age is that we can use a technique called data binning and put people of different age into different bins (groups). This usually improves the performance of machine learning models.

We are going to put the passengers in four age groups:

  • 1: (age <= 18)
  • 2: (age > 18 and <= 40)
  • 3: (age > 40 and <= 60)
  • 4: (age > 60)

We will perform this for train and test data.

Image by the author
Image by the author
Image by the author
Image by the author

Because machine learning models operate only on numeric values we need to replace test values for the column Sex with numbers to create numeric categories. These will be our categories:

  • 0: male
  • 1: female

    Image by the author
    Image by the author

    One other thing that would be useful to do is to extract Title information like (Mr. Mrs. Miss.) from the Name column and create bins (groups) similar to what we have done with the Age column and after that drop the Name column.

    Let’s display the created values.

    # Output
    Mr          517
    Miss        182
    Mrs         125
    Master       40
    Dr            7
    Rev           6
    Col           2
    Major         2
    Mlle          2
    Sir           1
    Mme           1
    Ms            1
    Countess      1
    Capt          1
    Don           1
    Jonkheer      1
    Lady          1
    Name: Title, dtype: int64
    # Output
    Mr        240
    Miss       78
    Mrs        72
    Master     21
    Col         2
    Rev         2
    Dr          1
    Ms          1
    Dona        1
    Name: Title, dtype: int64

As we can see we only really have 3 major groups here, Mr, Miss and Mrs. We will create four bins (grups) here with these groups and put everything else into the Other category. Our groups are going to look like this:

  • 1: Mr
  • 2: Miss
  • 3: Mrs
  • 4: everything else

    Now let’s look at the graph of the Title data.

    Image by the author
    Image by the author

As we can see people with the Title Mr. has significantly less change to survive which should be a useful information for our machine learning model.

Removing unnecessary features

Now, let’s remove the features we don’t think we need to train the model. In our example it will be Name, Ticket, Fare, Cabin, Embarked. We could still probably extract some additional features from those but for now we decide to remove them and train our model without them.

Image by the author
Image by the author

Now we need to prepare our train data and target information with survivals to train our Machine Learning Models. In order to do this we need to create another data set without the Survived column and create a target variable only with survival information. This is how Machine Learning models typically require the data for training – input (or X or independent variable) and output (or Y or dependent variable).

Image by the author
Image by the author

Building & training a Machine Learning model

Now we are ready to build and train our machine learning models.

We will use two different algorithms and compare the results to see which one performs better.

We are going to use K-nearest neighbors (KNN) classifier and Decision Tree classifier from Scikit-learn library.

K-nearest neighbors (KNN) classifier

We will create KNN model with 13 neighbors _(nneighbors = 13) and cross-validation technique to train our model by shuffling the data and splitting it into k-folds. Cross-validation technique helps to prevent unintentional ordering errors when training Machine Learning Models.

We will end up with several Machine Learning scores that we will need to average to achieve the final result of the model performance.

# Output
[0.82222222 0.76404494 0.82022472 0.79775281 0.80898876 0.83146067
 0.82022472 0.79775281 0.82022472 0.84269663]
# Output
Our KNN classifier score is 81.26%

Decision Tree classifier

We will do the same with the Decision Tree model and use cross-validation technique.

# Output
[0.8        0.79775281 0.78651685 0.78651685 0.86516854 0.78651685
 0.84269663 0.80898876 0.78651685 0.84269663]
Our Decision Tree classifier score is 81.03%

As we can see both our models achieved similar results and we have achieved quite good accuracy result of around 80% for both models which is good.

This result can be probably still improved by performing some more feature engineering with Fare, Cabin and Embarked columns which I encourage your to do.

Testing

Now we can run our model on the test data to predict the values.

# Output
array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, [...], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

We can now save the results to the file.

If you want you can upload your results here (https://www.kaggle.com/c/titanic) and take part in the Kaggle Titanic competition.

To read and display your results you can use the following code.

If you would like to learn more and experiment with Python and Data Science you can look at another of my articles Analysing Pharmaceutical Sales Data in Python, Introduction to Computer Vision with MNIST or Image Face Recognition in Python.

To consolidate your knowledge consider completing the task again from the beginning without looking at the code examples and see what results you will get. This is an excellent thing to do to solidify your knowledge.

Full Python code in Jupyter Notebook is available on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/5-Titanic-Challenge

Happy coding!


Not subscribing to Medium yet? Consider signing up to become a Medium member. It’s only $5 a month and it will give you unlimited access to all stories on Medium. Subscribing to Medium supports me and other writers on Medium.


Related Articles