A walkthrough of Logistic Regression and Naive Bayes.

The year was 1912, and the mighty Titanic set sail on her maiden voyage. Jack, a "20 year old" "third class" "male" passenger, won a hand of poker and his ticket to the land of the free. In the last hour of April 14th, Titanic struck an iceberg, and its fate was sealed. Will Jack survive this wreckage?
(Yes, I know he died in the movie, but if he were a real person, would he have survived?)
This is a binary classification problem because we’re trying to predict one of two outcomes: survive or not survive. There are many algorithms for classification, some work better than others depending on the data.
I’m going to train two different Machine Learning models to answer this question, and do an in-depth comparison of these different algorithms.
I’m using the Titanic dataset from Kaggle.
This dataset has a lot of information about a passenger: name, age, sex, class (1st, 2nd or 3rd), fare, number of siblings on board, etc.
Which of these features should we pick to predict Jack’s fate?
The art and science of feature selection deserves its own article. For now, let’s just apply some reasoning. Sex and age probably matter (remember in the movie they were like "women and children first"). Passenger class probably matters too. Let’s pick these 3 features.
The first algorithm I’ll use is Logistic Regression.
Algorithm 1: Logistic regression
Logistic regression predicts the likelihood of one outcome versus the other. In this case, we use the model to predict the probability that Jack lives. Because the model computes a probability, the output of the model is always between 0 and 1.
In 2D, the model is a logistic curve that best fit the dataset. In the plot below, each blue dot is a passenger, the x-axis is the age, and y-axis is whether they survived or not. 1 means survived and 0 means did not survive. The model is the red curve.

There are several ways of finding the function of best fit. Gradient descent is one of them, Newton’s method is another. For an in-depth look at the implementation, read this article.
Given a new input point on the x-axis (such as age = 39), we look at where the curve lands on the y-axis to see what the probability of survival is.
Note: this plot is NOT representative of the real dataset, it’s here for illustrative purposes only.
In 3D+, the model is a hyperplane that best fit the dataset.

Let’s do some predictions!
Before we can train a model, we need to do some data processing first.
- Split data into a training set and a test set. The training set is used to train the model, the test set is used to test the accuracy of the model.
- Transform categorical variables into a binary format via one-hot encoding.
A categorical variable is a variable that has two or more categories with no intrinsic ordering to the categories. Sex in this example is a categorical variable with two categories (male and female).
We have 2 categorical variables (sex and class). We can’t use the values of these variables in their original forms for training. In other words, we can’t pass the format [male, 3rd class]
into the training model. We have to convert them using one-hot encoding.
Through one-hot encoding, each category of a variable (e.g. male and female of the sex variable) becomes its own binary column in the input vector. The column will have value of 1 if passenger belongs to that category and 0 if not. In total, we’ll end up with 6 columns. 1 for age, 2 for sex, and 3 for class.
The format of the input vector would be:
[age, female?, male?, 1st class?, 2nd class?, 3rd class?]
The age is just a number. The second column female?
has 1 if female and 0 if male. The third column male?
has 1 if male and 0 if female, etc.
Jack’s datapoint [20, male, 3rd class]
becomes [20, 0, 1, 0, 0, 1]
.
To train my model, I used a library called SciKit learn. SciKit learn is a great machine learning tool and offers many learning algorithms.
import pandas as pd
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
# read the data from csv file
data = pd.read_csv('train.csv').T.to_dict()
X_categorical = []
X_age = []
y = []
for idx in data:
info = data[idx]
sex = info['Sex']
p_class = info['Pclass']
survived = info['Survived']
age = info['Age']
# don't use data if age is absent
if not math.isnan(age):
X_categorical.append([sex, p_class])
X_age.append([age])
y.append(survived)
# one hot encoding to transform the categorical data:
enc = OneHotEncoder()
enc.fit(X_categorical)
features = enc.transform(X_categorical).toarray()
# Combine the age vector with the transformed matrix
X = np.hstack((X_age, features))
# split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
# logistic regression to fit the model
clf = LogisticRegression().fit(X_train, y_train)
# Print out the prediction
print(clf.predict([[20,0,1,0,0,1]]))
print(clf.predict_proba([[20,0,1,0,0,1]]))
To see the code in its prettier original form, grab my Jupyter notebook on Github.
Would Jack survive? Most likely no. This model predicted the probability of surviving to be 0.1078528, and thus probability of dying to be 0.8921472.
Let’s now look at another learning algorithm, Naive Bayes.
Algorithm 2: Naive Bayes
Naive Bayes is a probability formula. Bayes Theorem finds the probability of A given B.

In our example, we want the probability of survival given male, 20, and 3rd class.
The mathematical representation is:

P(survive)
is just number of passengers that survived divided by total number of passengers.
From chain rule,

Check out this article where I explained how to compute the nominator and denominator in more detail.
SciKit Learn also has Naive Bayes. The training data X_train
and labels y_train
are the same as the Logistic Regression example above. After fitting the model we use it to predict Jack’s fate. Again, Jack’s datapoint after one-hot encoding is [20, 0, 1, 0, 0, 1].
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)
print(nb.predict([[20,0,1,0,0,1]]))
print(nb.predict_proba([[20,0,1,0,0,1]]))
Would Jack survive? Probably nein. This model predicted the probability of surviving to be 0.10072895, and thus probability of dying to be 0.89927105.
I’d conclude that realistically, Jack indeed would have died.
So how reliable are our models?
There are many metrics for evaluating a model’s performance. Let’s look at two of them: accuracy and F1 score.
Accuracy is simply the percentage of correct predictions.

F1 score is the "Harmonic Mean between precision and recall".
Say what?
Let’s break it down.
Harmonic mean between two numbers (precision and recall) is:

Precision is number of correctly predicted survived over all the survived predicted by the model.

Recall is number of correctly predicted survived over all actual survived in the dataset.

The higher the F1 score, the more robust the model.
Note: these metrics do have caveats, they work well when the data is balanced (i.e. approximately equal number of survived and died in the training samples). In the Kaggle data, there’s 59% died, so it’s relatively evenly spread.
We can use SciKit learn and the test dataset to get the accuracy and F1 score.
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
y_pred_logistic_reg = clf.predict(X_test)
y_pred_naive_bayes = nb.predict(X_test)
print(f'logistic regression accuracy: {accuracy_score(y_test, y_pred_logistic_reg)}')
print(f'logistic regression f1 score: {f1_score(y_test, y_pred_logistic_reg)}')
print(f'naive bayes accuracy: {accuracy_score(y_test, y_pred_naive_bayes)}')
print(f'naive bayes f1 score: {f1_score(y_test, y_pred_naive_bayes)}')
The metrics are:

Logistic regression scored higher than Naive Bayes.
Can we get the model to be more accurate with different features? What if we included fare or number of siblings on board? Are there better algorithms for this data than Logistic Regression? The answer is most likely yes. I will follow up on these open questions in later articles.
Now, would YOU have survived aboard the Titanic? Stay tuned for my web app where you can input your information and get your probability of survival!
You can grab the code for these examples from my Github repo.