Comparison of K-nearest neighbours classifier (KNN) and Decision Tree classifier with Titanic data set

Introduction
The purpose of this challenge is to predict the survivals and deaths of the Titanic disaster at the beginning of the 20th century. We will use two machine learning Algorithms for this task, K-nearest neighbours classifier (KNN) and Decision Tree classifier. We will perform basic data clean and feature engineering and compare the results of these two algorithms.
What will you learn?
You will practice two classification algorithms here, KNN and Decision Tree. You will learn how to prepare the data to achieve the best results by cleaning the data and advanced feature engineering.
Problem definition
We have two data sets. One for training (train.csv) containing survival and death information that we will use to train our model. One for testing (test.csv), without survival and death information, that we will use to test our models.
Step by step solution
If you don’t have your computer set up for data science read my article How to set up your computer for Data Science.
Create a project folder
Create a folder for a project on your computer called "Titanic-Challenge".
Download train.cs and test.csv data sets from Kaggle
https://www.kaggle.com/c/titanic/data
Place these data sets in a folder called "data" in your project folder.
Start a new notebook
Enter this folder and start Jupyter Notebook by typing a command in the Terminal/Command Prompt:
$ cd "Titanic-Challenge"
then
$ jupyter notebook
Click new in the top right corner and select Python 3.

This will open a new Jupyter Notebook in your browser. Rename the Untitled project name to your project name and you are ready to start.

If you have Anaconda installed on your computer you will already have all libraries needed for this project installed on your computer.
If you are using Google Colab, open a new notebook.
Loading libraries & Setup
First thing we usually do in a new notebook is adding different libraries we will need to use when working on the project.
Loading Titanic data
Now we need to load the data sets from the files we downloaded into variables as Pandas Data Frames.
Exploratory data analysis
It is always a good practice to look at the data.
Let’s look at the train and test data. The only difference between them is the Survived column which indicates if the passenger survived the disaster or not.
Below is also an explanation of each field in the data set.

- PassengerId: unique ID of the passenger
- Survived: 0 = No, 1 = Yes
- Pclass: passenger class 1 = 1st, 2 = 2nd, 3 = 3rd
- Name: name of the passenger
- Sex: passenger’s sex
- Age: passenger’s age
- SibSp: number of siblings or spouses on the ship
- Parch: number of parents or children on the ship
- Ticket: Ticket ID
- Fare: the amount paid for the ticket
- Cabin: cabin number
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Let’s now look at the dimensionality of the train Data Frame.
# Output
(891, 12)
We can see that the train data set has 891 records and 12 columns.
Let’s do the same for the test data set.

# Output
(418, 11)
The only difference with the test data set is the number of records which is 418 and the number of columns which is 11. We are missing the Survived column in the test data set. We will be predicting the Survived column with the machine learning model we are going to build.
# Output
There are 1309 passengers in both data sets.
891 in train data set.
418 in train data set.
What we can also see already is that we some missing data (NaN values) in our data sets. For our classification model to work effectively we will have to do something with the missing data. We will check this in details and deal with it a little bit later but for now let’s just look at the Pandas info() function so we can get an idea about the missing values.
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
We can see that in the training data Age, Cabin and Embarked has some missing values.
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
We have a similar situation in the test data set.
We can also check for null values using isnull() function.
# Output
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
# Output
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
Which confirms our earlier findings.
Now let’s look at it in a little bit more detail just to get an idea what is in it.
A good way to do it is to draw some charts from the data.
In our project we will use Matplotlib library to display charts.
We are primarily interested in characteristics of passengers who survived or not.
We will now create a function that will display whether the passangers survived or not the Titanic disaster, against a specified feature.
We will be using only train data set for that because only in this data set we have survival information.
And now let’s build the charts for selected features.

We can see that significantly more females survived than males. We have even more significant results for passengers that did not survive where females make a very small percentage in comparison to males.
Now let’s look at the passenger class.

We can see here that passengers from the 3rd class were more likely to die than passengers from the fist class, which had higher chances to survive.
These and other relationships between the features and the survival rate are very important to us and to our machine learning model.
Feature engineering
Once we have loaded our data sets and have a good understanding of the data we are working with, we will perform some feature engineering.
Feature engineering is the process of extracting features from the existing features in the data set in order to improve the performance of the machine learning model.
Usually, that means not only creating new features but also replacing missing values and removing features that do not contribute to the performance of the model.
Let’s have a look again at the missing values in our train data set.
# Output
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Replace missing value with the median value for the column
We have many missing values in the Age column. We will fill all the missing values in the Age column with the median values for that column. Median value is "the middle" value for the columns. To make the values more accurate we will calculate the median value for each sex separately. We will also perform this for both train and test data set.

We can see that all NaN values were replaced with the number.
# Output
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
And we no longer have null values in the train data set.
What we are also going to do here with age is that we can use a technique called data binning and put people of different age into different bins (groups). This usually improves the performance of machine learning models.
We are going to put the passengers in four age groups:
- 1: (age <= 18)
- 2: (age > 18 and <= 40)
- 3: (age > 40 and <= 60)
- 4: (age > 60)
We will perform this for train and test data.


Because machine learning models operate only on numeric values we need to replace test values for the column Sex with numbers to create numeric categories. These will be our categories:
- 0: male
-
1: female
Image by the author One other thing that would be useful to do is to extract Title information like (Mr. Mrs. Miss.) from the Name column and create bins (groups) similar to what we have done with the Age column and after that drop the Name column.
Let’s display the created values.
# Output Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Col 2 Major 2 Mlle 2 Sir 1 Mme 1 Ms 1 Countess 1 Capt 1 Don 1 Jonkheer 1 Lady 1 Name: Title, dtype: int64
# Output Mr 240 Miss 78 Mrs 72 Master 21 Col 2 Rev 2 Dr 1 Ms 1 Dona 1 Name: Title, dtype: int64
As we can see we only really have 3 major groups here, Mr, Miss and Mrs. We will create four bins (grups) here with these groups and put everything else into the Other category. Our groups are going to look like this:
- 1: Mr
- 2: Miss
- 3: Mrs
-
4: everything else
Now let’s look at the graph of the Title data.
Image by the author
As we can see people with the Title Mr. has significantly less change to survive which should be a useful information for our machine learning model.
Removing unnecessary features
Now, let’s remove the features we don’t think we need to train the model. In our example it will be Name, Ticket, Fare, Cabin, Embarked. We could still probably extract some additional features from those but for now we decide to remove them and train our model without them.

Now we need to prepare our train data and target information with survivals to train our Machine Learning Models. In order to do this we need to create another data set without the Survived column and create a target variable only with survival information. This is how Machine Learning models typically require the data for training – input (or X or independent variable) and output (or Y or dependent variable).

Building & training a Machine Learning model
Now we are ready to build and train our machine learning models.
We will use two different algorithms and compare the results to see which one performs better.
We are going to use K-nearest neighbors (KNN) classifier and Decision Tree classifier from Scikit-learn library.
K-nearest neighbors (KNN) classifier
We will create KNN model with 13 neighbors _(nneighbors = 13) and cross-validation technique to train our model by shuffling the data and splitting it into k-folds. Cross-validation technique helps to prevent unintentional ordering errors when training Machine Learning Models.
We will end up with several Machine Learning scores that we will need to average to achieve the final result of the model performance.
# Output
[0.82222222 0.76404494 0.82022472 0.79775281 0.80898876 0.83146067
0.82022472 0.79775281 0.82022472 0.84269663]
# Output
Our KNN classifier score is 81.26%
Decision Tree classifier
We will do the same with the Decision Tree model and use cross-validation technique.
# Output
[0.8 0.79775281 0.78651685 0.78651685 0.86516854 0.78651685
0.84269663 0.80898876 0.78651685 0.84269663]
Our Decision Tree classifier score is 81.03%
As we can see both our models achieved similar results and we have achieved quite good accuracy result of around 80% for both models which is good.
This result can be probably still improved by performing some more feature engineering with Fare, Cabin and Embarked columns which I encourage your to do.
Testing
Now we can run our model on the test data to predict the values.
# Output
array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, [...], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
We can now save the results to the file.
If you want you can upload your results here (https://www.kaggle.com/c/titanic) and take part in the Kaggle Titanic competition.
To read and display your results you can use the following code.
If you would like to learn more and experiment with Python and Data Science you can look at another of my articles Analysing Pharmaceutical Sales Data in Python, Introduction to Computer Vision with MNIST or Image Face Recognition in Python.
To consolidate your knowledge consider completing the task again from the beginning without looking at the code examples and see what results you will get. This is an excellent thing to do to solidify your knowledge.
Full Python code in Jupyter Notebook is available on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/5-Titanic-Challenge
Happy coding!
Not subscribing to Medium yet? Consider signing up to become a Medium member. It’s only $5 a month and it will give you unlimited access to all stories on Medium. Subscribing to Medium supports me and other writers on Medium.