The world’s leading publication for data science, AI, and ML professionals.

A Guide for Your Very First Machine Learning Project Part 2

A comprehensive tutorial for Titanic Regression dataset for beginners in Machine Learning. There are two types of Machine Learning, the…

A comprehensive tutorial for Titanic Regression dataset for beginners in Machine Learning

Image by Michael Dziedzic
Image by Michael Dziedzic

This is a follow-up article from the Iris dataset article that you can find out here that gives an introductory guide for classification project where it is used to determine through the provided data whether the new data belong to class 1, 2, or 3. In this article, we will go through the other type of Machine Learning project, which is the regression type. Regression or sometimes it is called a predictive Machine learning project will be able to predict the future outcome through the learning of the historical dataset given to the Machine. In this article, we will go through in-depth on how to build a regression model for first-timers.

The most suitable dataset for starters in regression is the "Titanic – learning from disaster" dataset, which can be downloaded from Kaggle: [here](https://colab.research.google.com/notebooks/intro.ipynb#recent=true). This project is considered as the second project because compared to Iris dataset in the previous article because this dataset has more features and a bigger size, which means more challenging but gives us more chances for burnishing our data wrangling, data cleaning, and data visualisation skills. This project will be done through Google Colab, which can be accessed through the link here.

To create a new notebook, simply access the link above and a pop-up window will show up, which you will only need to click the New Notebook button in the bottom right corner as shown in the picture on the left. Alternatively, you can click File and then choose New Notebook.

Similar to Jupyter Notebook, we still need to import the libraries that we will need in this project. Remember that although we are on a different platform, however, the function of the libraries in Jupyter Notebook and Google Colab is still the same. Let’s import the following libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Import Data Files

Next up, we need to import the dataset files. In Jupyter, we can simply use Pandas to locate the files from local and read them, however, because Colab is run on the cloud, we need to upload our files to Colab by clicking the file’s icon at the bottom left and then choose "upload to session storage button" on the far left of the files menu to upload the dataset files. Alternatively, the code below can be used to upload the files:

from google.colab import files
uploaded = files.upload()

Which will show the choose files or cancel upload options as shown in the picture below:

And then we can let Colab read the files by using the code similar to Jupyter’s code:

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Data Analysis

Now let’s analyse the data by using the same methods in the Iris dataset project as discussed in the previous article. Firstly, we need to analyse the column of the training dataset:

print(train_df.columns.values)

Will give an output of:

Try to print out the train file’s first and last 5 data.

train.head()
train.tail()

Which will give an output of:

Types of Data

What does this set of columns tells us? Unlike Iris dataset, which only contains sepal and petal length and width, which are continuous data, these columns contain 4 types of data, which are nominal, ordinal , continuous, and discrete data. Nominal and ordinal data are considered under categorical data, meanwhile, continuous and discrete data are branches from numerical type of data. The difference between categorical and numerical data is that the first one contains characteristics and non-sequential data, for instance, black or white, primary school and secondary school, etc. and the later contains numeric and sequential data, for instance A, B, and C scores.

To dig even deeper, we can define nominal data as a data that does not has any sequence at all, for instance, English, German, Spanish, and Italian languages and round or square table. But for ordinal data, although does not have any proper ratio, it still possesses levels between the data, for instance, primary school, secondary school, high school, and college; and level 1, 2, or 3 in spiciness.

Meanwhile, discrete data is a data that can be counted but cannot be measured, for instance, number of students in a class or the number of glasses in a cabinet. Therefore, continuous data is data that can be measured but cannot be counted, for example, length, weight, and hardware storage size.

From the output given by head and tail functions from pandas, we can conclude that the train data can be categorised as:

Nominal data: Survived, Sex, and Embarked

Ordinal data: Pclass

Continuous: Age, Fare

Discrete: SibSp, Parch

Data Analysis – Continued

Further things that need to be analysed in a dataset is the number of data that contains empty cells and the data types (float, integer, or string).

train.info()
print(' ')
test.info()

Which will give an output of:

From this information, we can see that in the training dataset, age, embarked, and cabin features contain empty cells, while for the test dataset, age, fare, and cabin features contain empty cells that need to be corrected. Furthermore, there are 2 floats, 5 integers, and 5 strings in the training dataset and there are 2 floats, 4 integers, and 5 strings in the test dataset.

We can analyse further the data from statistics description by using the following code:

train.describe()

Which will give an output of:

From these descriptions, we can understand that there are 891 passengers’ data to be included in training data. Since the survived is a nominal data, (0 or 1 for survived or not survived), 0.38 mean indicates that there are only 38% of people survived from this training dataset.

We will now describe the statistics for categorical features by using the following code:

train.describe(include=['O'])

Which will give an output of:

From this output, we can understand that everyone has a unique name because the number of counts and unique are the same. In terms of gender, there are almost 65% males in the ship according to the training dataset (577 frequency from 891 total data).

Analyse Further Through Pivoting Features

What do pivoting features means? Alright, let’s assume this data has too many rows and too many features (columns). However, after we have done the first analysis above, we get a little bit of idea about the data, but we still want to analyse a more specific group of data, for example, we might assume that children, women, and first class passengers are more likely to survive, however, this is just an assumption and we need to analyse specifically for these groups. This is where pivoting has its own advantageous by grouping a certain group of values that are similar and dropping the rest of the features that we do not want to analyse at this point.

In this case, we want to analyse the survival rate of first-class, second class, and third-class passengers survival rate by using the following code:

train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Pclass', ascending=True)

In this code, we are only showing Pclass and Survived features, and then we group Pclass values with the same unique number by averaging the rest of the features, wherein this case is Survived and sorting the output by Pclass in ascending order as shown in the table below:

As we can see from the output that our assumption is right that first-class passengers have a much higher survival rate compared to second and third class with 62.96%.

Aside from getting the average, we can also find out how many total passengers and how many survived from each class by using the sum or count function.

Next, we can then pivot the gender feature as shown below:

train[["Sex", "Survived"]].groupby(['Sex']).mean()

Now we can see the data with only male and female data, where female has much much higher survival rate with 74.2% compared to 18.89% from male category.

Now, let’s pivot the features for the survival rate of people that take their children along:

train[["Parch", "Survived"]].groupby(['Parch']).mean().sort_values(by='Survived', ascending=False)

This data is rather confusing because regardless of the number of parents or children involved, the survival rate seems unaffected. However, we should try out the data for people who bring along spouse or siblings below:

train[["SibSp", "Survived"]].groupby(['SibSp']).mean().sort_values(by='Survived', ascending=False)

Similar to the parents/children data, the spouse/siblings data seems does not affect the survival rate. Therefore, we can conclude that both the number of siblings/spouse and parent/ children they bring along does not correlate with the survival rate.

Data Visualisation

Now we would like to analyse data with a larger group of data, for example, the age feature, where unlike the previous group of data that have 2–9 unique groups, the age feature has more than twenty groups, therefore, it is more efficient to analyse by visualising the data as shown below:

ages = sns.FacetGrid(train, col='Survived')
ages.map(plt.hist, 'Age', bins=20)

We are using seaborn library because the best plot for categorical data is FacetGrid from seaborn and then plotted using matplotlib’s histogram function where bins mean how many times does the age data need to be divided in the histogram. For example, if the age range is from 0–80 and the bins are set as 40, it means each bar consists of age the number of people with a ratio of 2 years old. In this case, we are using 20, which means each bar consists of the number of people in every 4 years of age ratio as shown below:

As we can see from the histogram above, there are many people aged 15–25 who did not survive but babies aged 0–4 are mostly survived. Uniquely, the only 80 years old passenger survived. From this data visualisation, we should consider age in our training dataset as well.

Combined Data Visualisation

Through FaceGrid, we are not only able to compare 2 features, but 3 by adding the row section. This is useful if we wanted to compare the age group to other feature, let’s say the Pclass feature, which can be done by using the following code:

grid = sns.FacetGrid(train, col='Survived', row='Pclass')
grid.map(plt.hist, 'Age', color='green', bins=20)
grid.add_legend();

Which will give a histogram output of:

As we can see most people are in third class but most of them were unable to survive. But all infants from the second class were survived and there is only a small percentage of first-class’ infants failed to survive. Once again, there seems to be no correlation between age to the number of passengers in each class. With this, we still need to consider Pclass for training the data.

Now we can compare gender, embark, and class by using seaborn’s pointplot below:

grid = sns.FacetGrid(train, row='Embarked', height=2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex')
grid.add_legend()

Which will give us an insight of:

We can see that there is a pattern of except for embarked from C, males have a lower survival rate compared to females regardless of the class. In this case, the embarked feature needs to be included in the training set.

Next, how about the correlation between fare price to other features, such as embarked and gender features?

grid = sns.FacetGrid(train, row='Embarked', col='Survived')
grid.map(sns.barplot, 'Sex', 'Fare', color='orange')
grid.add_legend()

As we can see from the plot above, it is shown that passengers with higher fare price have a higher chance of survival. Another interesting insight is that most people who embarked from Q port pay the lowest fees and have an average survival rate. We need to consider banding the fare feature.

Data Wrangling

The first step in data wrangling is to drop the most feature with many missing values and does not have any correlation to survival, such as ticket and cabin feature, which can be done similarly as the Iris dataset project’s method:

train = train.drop(['Ticket', 'Cabin'], axis=1)
test = test.drop(['Ticket', 'Cabin'], axis=1)

And then merge both datasets:

merge = [train, test]

Creating New Feature

In this part, instead of dropping a feature, we would like to create a new feature called title. As we can see from the Name feature, the title is also included in this feature, therefore, we would like to use the Regular Expression method to extract the title from the name by identifying the pattern in each title that is ended with a dot (.), hence, the regex pattern is: ( w+. ). And then we can see the frequency of each title by using Pandas’ crosstab function as shown below:

for dataset in merge:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+).', expand=False)
pd.crosstab(train['Title'], train['Sex'])

Which will give an output of:

Data Cleaning

There are certain titles that are mostly survived and some are not, therefore, we need to modify and combine the titles with low frequency as follows:

for dataset in merge:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',
    'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Distinct')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

from merge, dataset, we are focusing on the newly created title feature and replace them into titles that are more common and modify several unique titles into "Distinct" title.

Now, we can analyse the data better by pivoting the feature as earlier:

train[['Title', 'Survived']].groupby(['Title']).mean().sort_values(by='Survived', ascending=False)

And we can see the result as:

With small size of data category in title, we can modify the title data into ordinal data, for example Mr as 1, Miss as 2, etc. as follows:

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Distinct": 5}
for dataset in merge:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)

fillna function is used to change the empty cell into a given value, where in this case, it is 0. Let’s print out the result:

train.head()

We can see that the title has been modified into ordinal data in the far right column. This means that we are no longer need the name feature and it is wiser to drop any feature that is not correlated to the survival to reduce processing time during training section.

train = train.drop(['Name', 'PassengerId'], axis=1)
test = test.drop(['Name'], axis=1)
merge = [train, test]

Convert Categorical Data into Ordinal Data

Since Machine Learning can only process numerical data, it is also important for us to modify other categorical data into ordinal data, which in this case is the Sex feature, to modify female as 1 and male as 0:

for dataset in merge:
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

Now the data is cleaner with numerical values:

Another categorical data that need to be converted into ordinal data is from the embarked feature where the data consists of S, C, and Q. Based on the .info() analysis, there are 3 missing data, which can be replaced by searching for the most common data by using mode() function:

freq_port = train_df.Embarked.dropna().mode()[0]

Which will give an output of S port. Therefore, we will replace the missing value with S port.

for dataset in merge:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
train[['Embarked', 'Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False)

And then convert the categorical data into ordinal data.

for dataset in merge:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

Which will transform the embarked feature data into 0, 1, and 2:

Estimating Empty Values

As we can see from the .info() functions, there are several features that have missing values in their cells. We can fill these missing values by creating random value based on mean and standard deviation.

Fill Missing Values of Age Feature and Modify the Data into Ordinal Data

Firstly, we need to fill the missing values from the age features using the above method as shown in the code and formula below:

for dataset in merge:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
age_null_random_value = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)

After the missing values have been filled and the float data type modified into integer by using .astype, we can create a new feature that categorised the ages by 5 ratio:

train['AgeBand'] = pd.cut(train['Age'], 5)

From these categorised age groups, we can then modify the age based on the age group as shown below:

for dataset in merge:
dataset.loc[ dataset['Age'] <= 16, 'Age']                  = 0
dataset.loc[(dataset['Age'] > 16) &amp; (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) &amp; (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) &amp; (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

Which will change the age’s continuous data into ordinal data:

Since we do not need the AgeBand feature any longer, we can drop it.

train = train.drop(['AgeBand'], axis=1)
merge = [train, test]

Create New Feature by Combining the Existing Features

We can also create new feature by combining the value of the current features, for instance, we can create FamilySize feature by combining the value of Parch feature with the value of SibSp feature as follow:

for dataset in merge:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
train[['FamilySize', 'Survived']].groupby(['FamilySize']).mean().sort_values(by='Survived', ascending=False)

Which will give us an output of:

From this new feature, we can create another new feature with ordinal data, whether the passenger is alone or with family as follow:

for dataset in merge:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
train[['IsAlone', 'Survived']].groupby(['IsAlone']).mean()

With the more efficient ordinal data from IsAlone feature that represents FamilySize, SibSp, and Parch features, we no longer need the rest.

train = train.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test = test.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
merge = [train, test]

Create New Feature by Multiplying the Existing Features Data

We can also create an artificial data from the multiplication of age and sex to get a new feature that can act as a coefficient:

for dataset in merge:
dataset['Age*Class'] = dataset.Age * dataset.Pclass
train.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

Continuous Data Conversion Into Ordinal Data for Fare Feature

As we learned from the info() function’s results, Fare feature in test dataset is missing several data. Therefore, we can replace them with the median value as follows:

test['Fare'].fillna(test['Fare'].dropna().median(), inplace=True)

Then, we can convert the continuous data into ordinal data with the same procedure as the age feature, which is by creating ratio and replace the data based on the created ratio.

train['FareBand'] = pd.qcut(train['Fare'], 4)
train[['FareBand', 'Survived']].groupby(['FareBand']).mean().sort_values(by='FareBand', ascending=True)

These ratios can be converted into ordinal data as follows:

for dataset in merge:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) &amp; (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) &amp; (dataset['Fare'] <= 31), 'Fare']   = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
train = train.drop(['FareBand'], axis=1)
merge = [train, test]

Final Output of Data Cleaning and Data Wrangling

After modifying the data by considering each feature and missing value, the final output of the training data is as follow:

On the other hand, the final result of the test data is as follow:

Data and Label

Similarly to classification type of Machine Learning, Regression model is also considered as supervised learning, therefore, it requires data training, training label, test data, and test label. However, in this project we will not be using test label, therefore, we can split the data into training data, training label, and test data as follow:

X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test  = test.drop("PassengerId", axis=1).copy()

Prediction Using Regression Models

Many algorithms can be used for both regression/prediction project and also classification project. In this case, we are going to use algorithms that are suitable for both case, which are:

  • Logistic Regression
  • Gaussian Naive Bayes
  • Support Vector Machines
  • k-Nearest Neighbors
  • Random Forrest
  • Artificial neural network

We will use sklearn library as shown in the Iris dataset project to predict the people who will survive the Titanic disaster based on the provided and modified data that we have done so far. But first, we need to import the library for performing those algorithms by using sklearn as follow:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier

Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

Which gives us an accuracy of: 80.7%

Naive Bayes Classifier

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

Which gives us an accuracy of: 76.09%

Support Vector Machine (SVM)

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

Which gives us an accuracy of: 82.38%

K-Nearest Neighbour (KNN)

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

Which gives us an accuracy of: 84.85%

Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

Which gives us an accuracy of: 77.89%

Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

Which gives us an accuracy of: 84.85%

From these results, we can rank the performance of each algorithm based on their accuracy:

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              ]})
models.sort_values(by='Score', ascending=False)

Which will give an output of:

As we can see random forest has the highest score, therefore, we will be using random forest prediction as in Y_predict function to print out the result:

pd.set_option('display.max_rows', 500)
output = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": Y_pred
    })
print(output)

Which will give an output of:

Due to long record, these are the only data that can be displayed here
Due to long record, these are the only data that can be displayed here

This output can be compared to the gender_submission.csv file to evaluate the prediction accuracy of the real data. I believe it is close to the real data and the random forest is quite a reliable algorithm and has a good reputation for producing high accuracy prediction.

Wrap It Up

Through this project we have learned a lot of techniques, not only in data analysis but also in data cleaning and data wrangling to make sure we will be able to provide high-quality data before being processed by the Machine Learning algorithms.

Congrats to make it this far I believe you have the capacity to take on regression projects if you have followed this project to the very last. I wish you all the best for your next project.


Related Articles