Kaggle Titanic Competition in SQL

Exploratory Data Analysis & Feature Engineering

Do Lee
Towards Data Science

--

Introduction

There is nothing more powerful than learning something new or taking a skill to the next level by simply doing. In this article, I’ll use SQL (Postgres) to conduct my exploratory analysis and create a transformed feature training set for my machine learning model.

Although Python or R is the current de facto language for data science and machine learning, having a solid foundation in SQL, at times, will help you iterate faster and allow you to explore your data quickly. At the end of the day, most of the data we work with lives in a relational database. Having the ability to write SQL queries fluidly will allow you to tackle data with ease and quickly get you started on your analysis.

Goal

Although Python would be my preferred approach, I wanted to see if I could do all my exploratory analysis and feature engineering in SQL. My goal is to show how powerful and complementary SQL can be to one’s workflow. Although one obvious limitation of SQL is the ability to visualize data natively, you’re still able to get big gains by quickly writing queries and outputting results. Let’s get started!

My Setup

  • macOS Mojave Version 10.14.6
  • Postgres (PostgreSQL) 12.2
  • pgAdmin 4.21 (SQL web querying tool)
  • psql (Terminal access to databases and tables)

In my follow-up article, I complete my supervised classification model with Python and share my highest score achieved on Kaggle’s Public Leadership Board. Check it out here!

Exploration

My research into the sinking of the Titanic revealed that women and children were prioritized over others in transporting passengers into lifeboats. This notion will play a big role in how I group and analyze the Kaggle dataset.

Additionally, there are discussions of whether to keep training and test datasets separate or combine them for feature engineering and selection. In most cases, the former will make sense to avoid data leakage.

The common practice is putting aside the test dataset, forgetting about it, and only using the training dataset to impute and engineer features. For this exercise, I have decided to combine the training and test so that I have more data to work with to predict survival. This is a fixed universe, and we are not expecting new data to flow into the prediction pipeline.

To get started, I downloaded the train.csv and test.csv files from Kaggle and imported the files to two tables I created in the Postgres database. Next, I combined the two tables to create my first working table (titanic_train_test_raw).

To insert records into my tables, I opened psql on my terminal by typing “psql postgres.” This is PostgreSQL command-line tool. I ran the following two psql copy commands to insert the CSV files into my tables.

\copy titanic_train_raw from ‘file/path/train.csv’ delimiter ‘,’ csv HEADER;\copy titanic_test_raw from ‘file/path/test.csv’ delimiter ‘,’ csv HEADER;

To better understand the data, I calculated the survival_rate for every feature available in the training set. I’m not sharing the entire output here because I’ll share them piece by piece throughout the article. However, this allowed me to develop a mental image of what features might be worth digging deeper and getting a feel for each feature's relative importance.

1. Missing values

Let’s take a closer look at missing values and see what potential features need estimating or even drop due to sparsity. The missing value calculations are done with a combined train and test data. Because test data does not have the Survived target feature, the combined Survived column has 31.93% of its values missing. Also, Age, Cabin, Fare, and Embarked are missing 20.09%, 77.46%, 0.08%, and 0.15%, respectively.

2. Overall survival

From the training dataset, I calculated the survived % versus not survived %. Out of the 891 passengers in the training data, only 38.38% of them survived. Also, this analysis allowed me to see that the target class is not too imbalanced — it’s approximately a 60%/40% split.

3. Survival by sex and age

Female passengers had the highest probability of survival versus male passengers. By sex alone, training data tells us that the survival rate is approximately 74.2% for women while 18.89% for men.

Using pure SQL, I created 10 bins (age_cohort) to group different ages. I’m not going for any precise division of age groups, and thus the number of bins is arbitrary, but create enough to see relevant patterns. And one of the most critical pieces of information emerges from this view. While the female survival rate remained high across all age cohorts, men nine years of age or younger had a survival rate of almost 60% (line 9 in Data Output). By sex alone, men’s survival was under 20%, but by slicing the data with age cohorts we see that a subgroup of men had a relatively higher survival rate — children.

4. Survival by fare and average fare per passenger

To quickly get a sense of fare correlating to survival, I used ntile window function to evenly bucket passengers into 6 bins and created statistics for closer examination. This is a quick and dirty binning method because the same fare amounts could fall into different bins as shown by fare_min and fare_max.

At a glance, survival_rate looks to tick up as fares become more expensive. At the same time, fare_mean and fare_stddev jump pretty dramatically from bin 5 to 6. The highest fare_max price is $512.3292, while the lowest fare_min is $0. There is some noise in the fare column. What continuous to hold is that female survival is much higher than men’s. Based on the fare_grouping attribute, female survival is 2x — 9x higher than men’s (survival_ratio).

I soon discovered that the fare represents the ticket's total cost and not the fare per passenger. For example, ticket PC 17755, there are four passengers. The Cardeza is a wealthy family traveling in first class with their two employees, Miss Ward and Mr. Lesurer.

To get a sense of how much per fare cost, I divided the fare amount by the total number of passengers on each ticket and grouped the averages by Pclass. When examining the ticket feature, I saw families, mixed groups, and individuals attached to individual tickets. As a result, rather than using [SibSp (siblings & spouses) + ParCh (parents & children) + 1 (PassengerId)], commonly used by other Kaggle folks to calculate the family_size feature, using the number of passengers on each ticket generated a smaller standard deviation and gave me confidence that this would provide a relatively more accurate average fare per passenger.

On average, the first-class fare per passenger came out to be about $32, second class fare per passenger about $12, and third class about $8. This averaging removed quite a bit of noise built into the original fare feature. And fare_mean looked to be highly correlated to Pclass. Thus, having both features, Pclass, and fare_per_passenger, might be redundant. Later on, we might want to drop one of these during the model build.

5. Survival based on Title from original Name feature

There are 17 unique titles, and some have high survival rates; however, most suffered from infrequency. There are only two passengers in the training data with the title, Major, and one passenger with Sir. To make each grouping relevant and impactful for modeling, I rolled up infrequent titles into four — Mr, Mrs, Miss, and Master — and thus creating a title_grouping feature. For example, Mlle, which is an abbreviation for Mademoiselle, rolled up into Miss.

The title Master is interesting because it has a relatively high survival rate, and there are quite a few of them in the training data. During this period, boys were given this title until they reached adult age. Now we have a way to identify boys using the title_grouping while all girls and women are rolled up into Mrs. or Miss.

I calculated the survival rate by each title_grouping, and numbers are very telling. By carving out the boys from the male group, men’s survival rate drops. In contrast, boys’ survival is nearly 60%. I’ll encode this feature into numeric values during the final training dataset prep work.

6. Survival based on being a woman or a child

I came across Chris Deotte’s article about a feature he created called “woman-child-groups” based on grouping passengers with family names (surnames) and titles. Essentially, it focuses on group survival. With that in mind, I created a binary feature flagging if each passenger is a woman or a boy. The is_woman_child = 1 captures all women and children while is_woman_child = 0 captures all adult men. This flag might be redundant with title_grouping and potentially contribute to overfitting the model, but let’s see how this works out.

7. Survival based on cabin level (a.k.a. deck)

Some of the readings I did for this showed that the majority of the third class passengers were located in cabin levels F and G. With that, it made sense to look at cabin levels more closely, although 77% of the values were missing. I hypothesized that lower cabin levels decreased a passenger’s chance of survival.

Cutaway diagram of RMS Titanic, midships

After doing some analysis, I observed a decrease in survival rate when passengers were part of a lower class and lower cabin levels. I created a logic to approximate each passenger’s cabin level using Pclass and Embarked, which I’ll dive deeper into in the feature engineering section.

Embarked alone doesn’t quite tell you a story of survival as shown below, but I believe it can be used to create new features that will have relatively higher predictive power. Titanic initially stopped at Southhampton, England, and then moved on to Cherbourg, France. Finally, stopping at Queenstown, Ireland, before sailing off to New York City.

On a quick side note, there are two passengers with missing Embarked values, but I looked up these passengers and they both boarded the ship at Southhampton (S), England.

I assumed a correlation between the embarked location and the deck (cabin_level) the cabin was located. In other words, every boarding location was for a specific set of cabins, and there was some form of organization around which cabins would be filled first based on the boarding location. Thus, if you had a third-class ticket that started with F, which indicated the deck the cabin resided in, you most likely boarded the ship from Southhampton, England, for example.

Feature Transformation and Engineering

Now that I understand the original training features and better insights into passenger survival, it’s time to consolidate our findings and start putting together the optimized training dataset for model fitting.

1. is_one_family / is_mix_group / is_alone

Assumption: I am assuming that the same ticket number defines passengers traveling together. Furthermore, I’m assuming that if an individual is the only passenger attached to a ticket, this passenger is a solo traveler while potentially having other family members or relatives traveling on different tickets. Therefore, I grouped passengers based on ticket number and surname extracted from the original Name feature.

There is a trade-off with this assumption, but I’m assuming that this captures most of the group scenarios. And the main goal of partitioning groups in this manner is to generalize the ship’s population.

  • If a given ticket has a single surname with multiple passengers, that is considered one family, where 1 is True, and 0 is False (is_one_family). This is primarily to flag families traveling together. Thus, if a passenger was part of a single ticket family, this flag is marked as 1.
  • If a given ticket has multiple surnames and multiple passengers, this is flagged as a mixed group (is_mix_group). For example, ticket number 1601 has 7 different surnames. There are a total of 67 tickets with two or more surnames attached. I’m only showing the top 12 tickets here.
  • If a given ticket has a single surname and a single passenger, this is a passenger traveling alone (is_alone). Taking a closer look at the male population, adult male passengers traveling alone had the lowest survival rate (15.57%) compared to adult males with families (17.46%) and boys with families (57.5%).

At a glance, adult male passengers without families or part of a mixed group had the lowest chance of survival. We do see some interesting patterns in this output. Male passengers traveling with their families had the highest rate of survival. Most likely, these male passengers were boys. I’ll verify this below.

When I carved out the boys from the male population, the picture is much clearer. Boys survived at a higher rate than adult men.

2. family_size

To determine each passenger's family size, it’s logical to add SibSp, ParCh, and the current passenger (+1). However, rather than defining this at the passenger level, it made more sense to determine the maximum family size based on ticket number and surname.

I’m sticking with the assumption that families traveled together and, in this case, on the same ticket. As a result, the sum of SibSp, ParCh, and current passenger, which represents the family size, is matched with a ticket number and surname. I created a mapping SQL subquery where if you had the same ticket number and surname, then the family size (SibSp + ParCh + 1) would be mapped to the passenger.

Once the logic was put together, I looked at the average survival rate by family size and Pclass. Relatively large families had a lower rate of survival. It can be loosely inferred that if you were part of a big family, you died. Also, you died if you were alone. Since there are only a handful of tickets with four or more passengers, it’s hard to generalize large groups' survival. However, based on available data, it is worth seeing if the family size has some relative importance in predicting survival.

3. Build the first consolidated feature table: titanic_train_test_raw_v2

I am now putting together my first consolidate table based on all my findings above. In this step, I’m incorporating new features and transforming existing features for better model consumption. Here is a list of all the additions and enhancements.

  • Encoding Sex feature as female = 1 and male = 0
  • Extract family_name (surname) from Name feature
  • Adding title_grouping feature that I shared earlier in the post
  • Adding is_woman_child feature
  • Extracting the cabin levels (deck) from the Cabin feature
  • Encoding Embarked feature as S = 0, C = 1, and Q = 2
  • Adding family_size feature based on ticket number and family_name (surname)
  • Adding is_one_family, is_mix_group, and is_alone features
  • Adding fare_per_passenger feature by dividing fare by count of PassengerId’s found on each ticket

4. Build the second consolidated feature table: titanic_train_test_raw_v3

Using some of the features created earlier, I created a second consolidated table (v3) to add more features and make additional changes. It’s better to build multiple tables in this fashion when using SQL because it helps to keep things organized and makes SQL more readable by having smaller chunks.

  • Age missing values: Fill in missing Age values (line 11 & lines 44–52). I grouped the data by Pclass, Sex, and title_grouping to calculate each grouping's average age. I went with the mean rather than the median because both sets of numbers are similar. I also calculated the standard deviation of the mean to examine the variability.
  • Age_bucket: Additionally, I created a feature called age_bucket (line 12). I decided to take the continuous Age variable and put them into 7 bins. Because the Age feature contains estimated averages, it might be better to convert this into a categorical feature to minimize the noise in the continuous feature.
  • Fare_bucket & fare missing values: I calculated the average fare per passenger per Pclass during my exploratory analysis, and it was clear that prices were tied to class. I filled the missing fare values with fare_per_passenger tied to each passenger’s Pclass. At the same time, I created 5 bins to created a categorical feature.
  • Title_grouping: The four text string labels in this feature were converted to numeric values. (Master = 0, Mrs = 1, Miss = 2, and Mr = 3)
  • Cabin_level & cabin level missing values: My approach to filling out the missing cabin_level values is to count all passengers by Pclass, Embarked, and cabin_level where Cabin is not null. Next, I ranked in descending order the number of passengers attached to Pclass, Embarked, and cabin_level, and for each combination of Pclass + Embarked, I grabbed the cabin_level with the highest number of passengers attached to it.
This is a subquery in building out the titanic_train_test_raw_v3 table.

If Pclass = 1 and Embarked = 0, then cabin_level = C.

If Pclass = 1 and Embarked = 1, then cabin_level = C.

If Pclass = 1 and Embarked = 2, then cabin_level = C.

If Pclass = 2 and Embarked = 0, then cabin_level = F.

If Pclass = 3 and Embarked = 0, then cabin_level = G.

And so on!

5. Building “women-child-group” modified logic

I went ahead and tried to re-create Chris Deotte’s feature with some modifications in SQL. I’ll go over what I did and, in my follow-up article, I’ll be testing the feature for impact and importance.

  • First, each passenger’s title is labeled as man, woman, or boy. Simultaneously, all men’s titles are labeled as ‘noGroup’ because the priority for survival is given to women and children.
  • Next, I calculated the surname frequency for each passenger. This frequency looks at the whole dataset and not at the ticket level. And any passengers with a surname frequency of one or less, if any are labeled as ‘noGroup.’ Thus, women and children in families are prioritized over women who are traveling alone.
  • At this stage, the probability of survival is calculated using the training data’s Survived target feature based on the updated surname feature. At this point, most surnames have changed to ‘noGroup,’ and thus, all passengers’ survival will be generalized using this updated surname feature.
  • Finally, using the title and surname_survival, I created a new binary flag, title_surname_survival. This logic was incorporated when the final training and test datasets were created. I created two tables to capture each — titanic_train_ml_features_v0 and titanic_test_ml_features_v0.
  • ML Training Dataset: titanic_train_ml_features_v0
  • ML Test Dataset: titanic_test_ml_features_v0 (Identical to training dataset except Survived feature is missing.)

Summary

There is quite a lot of information presented in this article. I conducted my exploratory analysis and feature engineering using purely SQL. I leveraged GROUP BY, window functions, aggregation functions, subqueries, WITH clause, HAVING clause, and other SQL techniques to slice and dice the data.

Photo by Joshua Aragon on Unsplash

The next step is to export the finalized training and test datasets for analysis using Python. In my follow-up article, I’ll use Python to test different ML models, understand feature importance, and tune the chosen model using RandomizedSearchCV and GridSearchCV. Finally, I‘ll submit my predictions to Kaggle and see how I rank among other entries!

Here’s an overview of what I cover in the follow-up article:

Title: “Kaggle Titanic Competition: Model Building & Tuning in Python”

  • Import Libraries
  • Prepare Train and Test Data Frames
  • Correlation Coefficient Matrix
  • Create Helper Function: Output Model Stats
  • Multiple Fitted Models and Best Fit Model
  • Create Helper Function: Output RF Feature Importance Ranking
  • Feature Selection with Random Forest Feature Importance, Permutation Importance, and Hierarchical Clustering
  • RandomizedSearchCV: Random Forest Classifier
  • GridSearchCV: Random Forest Classifier
  • Conclusion: Latest Results & Final Thoughts

If you have any questions, comments, or feedback, please let me know. Thanks!

--

--