Kaggle Titanic Competition: Model Building & Tuning in Python

Best Fitting Model, Feature & Permutation Importance, and Hyperparameter Tuning

Do Lee
Towards Data Science

--

Photo by Paul Biondi on Unsplash

Background

I conducted my initial exploratory analysis and feature engineering in SQL. In my previous article, I demonstrated how powerful SQL can be in exploring data in relational databases. For more context, it might be worthwhile checking it out before reading this article, although it’s not required. You can find the article here!

I‘ll be using the train/test datasets prepared earlier in the “Kaggle Titanic Competition in SQL” article to predict passenger survival.

Overview

  • Import Libraries
  • Prepare Train and Test Data Frames
  • Correlation Coefficient Matrix
  • Create Helper Function: Output Model Stats
  • Multiple Fitted Models and Best Fit Model
  • Create Helper Function: Output RF Feature Importance Ranking
  • Feature Selection with Random Forest Feature Importance, Permutation Importance, and Hierarchical Clustering
  • RandomizedSearchCV: Random Forest Classifier
  • GridSearchCV: Random Forest Classifier
  • Conclusion: Latest Results & Final Thoughts

Import Libraries

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, roc_auc_score

Prepare Train & Test Data Frames

Using Pandas, I imported the CSV files as data frames. The resultset of train_df.info() should look familiar if you read my “Kaggle Titanic Competition in SQL” article. For model training, I started with 15 features, as shown below, excluding Survived and PassengerId.

train_df = pd.read_csv(‘file/path/data-train.csv’)
test_df = pd.read_csv(‘file/path/data-test.csv’)

Correlation Coefficient Matrix

As a first step, I created a pairwise correlation matrix using the corr function built into Pandas and Seaborn to visualize the data. It calculates the Pearson correlation coefficients (linear relationships) as the default method. I also used Spearman and Kendall methods, which are both available in pandas.DataFrame.corr.

All the results looked similar across the board. Relatively speaking, Spearman’s rank-order correlation method, which measures monotonic relationships, might be the best here without diving deep into the concept of correlation and association for different types of features. One caveat is that Spearman will treat nominal features as ordinal features.

Just as a side note, at this point, all my features have been converted to numerical values composed of binary (dichotomous), ordinal (categorical and ordered), nominal (categorical and not ordered), and continuous features. I wanted to quickly see what features were correlated with each other and the magnitude beyond the obvious ones.

If I were to dive deeper into the specifics of this exercise, then we also need to talk about associations between categorical features versus correlation among binary and continuous features. In measuring the association between two nominal features, we would have to dive into Cramer’s V or Pearson’s chi-square test. However, in my opinion, this should be a good enough approach to get an initial baseline read. If I missed anything here, feel free to let me know.

You can find it here.

I generated the correlation coefficient heatmap and paid attention to the absolute values in the 0.8 to 1.0 correlation range. These correlation thresholds are arbitrary, and I’ll be looking at various thresholds to determine what works best later on.

train_corr = train_df.drop(columns=["survived", "passengerid"]).corr(method='pearson')
plt.figure(figsize=(18, 12))
sns.set(font_scale=1.4)
sns.heatmap(train_corr,
annot=True,
linecolor='white',
linewidth=0.5,
cmap='magma');
Pearson

After diving a bit deeper into the numbers using 0.8 as my Pearson correlation threshold, I found these pairs (shown below; output from df_corr) to be highly correlated. Later on, I’ll leverage Spearman with other methods to select important features for my final model.

# Pearson’s correlation analysis using an arbitrary correlation threshold (absolute values)corr_threshold = 0.8
train_corr_abs = train_corr.abs()
feature_1 = []
feature_2 = []
corr_coeff = []
for col in train_corr_abs:
for idx, val in enumerate(train_corr_abs[col]):
if ((val >= corr_threshold) and (val < 1)):
feature_1.append(col)
feature_2.append(train_corr_abs[col].index[idx])
corr_coeff.append(val)
df_corr = pd.DataFrame({‘feature_1’: feature_1,
‘feature_2’: feature_2,
‘corr_coeff’: corr_coeff})
df_corr
Results from df_corr; Output contains duplicates.
Duplicates are removed, and six highly correlated pairs remain.

Having identified highly correlated pairs, this analysis will help later when dealing with any regression or linear models. High multicollinearity results in features or coefficient estimates becoming sensitive to small changes in the model. This can also impact non-linear models.

The bottom line is that multicollinear features can create an ineffective model, and understanding feature importance can be skewed. I’m not going to focus my energy on pairs with mild multicollinearity. For now, I’ll keep this on the back burner and address this issue directly when the time comes.

train_corr = train_df.train_df.drop(columns=["survived",       "passengerid"]).corr(method='spearman')
plt.figure(figsize=(18, 12))
sns.set(font_scale=1.4)
sns.heatmap(train_corr,
annot=True,
linecolor=’white’,
linewidths=0.5,
cmap=’magma’)
Spearman
train_corr = train_df.drop(columns=["survived", "passengerid"]).corr(method='kendall')
plt.figure(figsize=(18, 12))
sns.set(font_scale=1.4)
sns.heatmap(train_corr,
annot=True,
linecolor=’white’,
linewidths=0.5,
cmap=’magma’)
Kendall

Create Helper Function: Output Model Stats

To start, I trained nine different models by fitting X_train and y_train. To expedite my workflow, I created a function to output model performance and diagnostic metrics to quickly see the numbers and determine what model might work best. These metrics are listed in the function’s docstring.

Additionally, I defined a pipeline object (line 27 below) containing a scaler and an instance of an estimator. I do not scale the X_train and X_test in few cases when using random forest because it is not necessary to do so.

Multiple Fitted Models & Best Fit Model

The helper function has three parameters. First, it needs a dictionary with the model's name (string) as the key and model class instantiation as the value. Second, it needs the feature training dataset (X_train) and, lastly, the target class data (y_train). Let’s examine the results!

Immediately, random forest and decision tree stood out from the rest with an accuracy of 98.54%. I know decision tree tends to overfit, so I wasn’t too surprised. On the other hand, random forest is an ensemble of decision trees designed to minimize overfitting by taking a random subset of features and rows to create a forest of decision trees and voting on the prediction outcomes. This random process generates a better model with higher bias and lower variance.

At a closer look, the accuracy scores using cross-validation with Kfold of 10 generated more realistic scores of 84.07% for random forest and 81.3% for decision tree. Other models that also stood out were KNN, SVM, logistic regression, and linear SVC, with all respectable scores. A high standard deviation is indicative of a model that might not generalize well with new data, so I paid attention to this as well.

Let’s take a closer look at precision and recall. In this case, it made sense to maximize both precision and recall, and a high F1 score would be indicative of that. Although there is a precision-recall tradeoff, relatively high precision [TP/(TP+FP)] gives me the accuracy of positive predictions. In contrast, relatively high recall [TP/(TP+FN)] gives me % of actual positives correctly detected by the model. Recall is also known as True Positive Rate (TPR) and sensitivity. Also, I wanted a high AUC under the ROC curve. As a result, I narrowed down my list to four models — random forest, KNN, logistic regression, and linear SVC.

In the end, I decided on random forest, although other models had slightly better scores. The slight differences seemed negligible in my eyes. The goal is to leverage random forest’s impurity-based feature importance and permutation importance for the feature selection process.

Create Helper Function: Output RF Feature Importance Ranking

To quickly output feature importance ranking using random forest, I created a helper function to do this. It outputs a Pandas data frame with feature names with their corresponding feature importance scores in ranked order.

Feature Selection with RF Feature Importance, Permutation Importance, & Hierarchical Clustering

Iteration 1

Going back to the correlation coefficient matrix, there were five pairs flagged as highly correlated or associated with one another. With all the features as defined by X_train and X_test as shown below, I examined the results of RF’s feature and permutation importance. I also used hierarchical clustering and Spearman’s correlation matrix to assist in feature selection.

X_train = train_df.drop([“survived”, “passengerid”], axis=1)
y_train = train_df[“survived”]
X_test = test_df.drop([“passengerid”], axis=1)
rf_base = RandomForestClassifier(n_estimators=100, random_state=0)
rf_base.fit(X_train, y_train)
n = len(X_train.columns)
importance_scores = rf_base.feature_importances_
rf_feature_ranking(n, importance_scores)
Random Forest Feature Importance

RF’s feature importance is a solid start to gauge what features are important but does not always give a definitive view of importance and can be misleading. The underlying mechanism of RF’s feature importance is biased. It tends to overestimate the importance of certain features, such as continuous or high cardinality categorical features. In a nutshell, the decrease in impurity is averaged for each feature in a forest of decision trees, and then the features are ranked based on this averaged value. Here’s a good read on this and this and this!

Collinear features will be an issue as well. Therefore, utilizing only RF’s mean decrease in impurity-based feature importance won’t show you the whole picture and must be one of many tools to understand feature importance better.

My concern at this point is collinear features. As a result, I’m leveraging permutation importance and hierarchical clustering to determine what features are relevant. Conceptually, in permutation importance, it calculates a baseline accuracy for the features using a validation set. Next, it permutes all the values for a single column/feature and measures the change against the baseline accuracy using the test dataset. This would be repeated for every feature.

In my research, I came across an informative article on this particular topic on Scikit-Learn’s website. I’m going to leverage the code found here and complete my feature selection process. Also, I’ll rely on domain knowledge and guided trial-and-error to see what combinations of features will have the best outcome.

Permutation importance is relatively more reliable than feature importance, although the former is also influenced by collinear features and inflates the importance of impacted features. One agreeable recommendation that came out of the two initial views was that is_alone, is_mix_group, and is_one_family do not add much value to the model.

For hierarchical clustering, the y-axis on the dendrogram represents closeness or dissimilarity. As it gets closer to 0, the closer the distance between clustered features indicating correlation/association. In this iteration, I examined all clusters falling under 1.5 — an arbitrary threshold. The code chunk used here is also available on Scikit-Learn’s website.

The dendrogram on the left and Spearman’s correlation matrix on the right

Using the dendrogram, I took a closer look at smaller clusters with two features (e.g., pclass and cabin_level) and heuristically determined what features might need to be dropped. I decided to drop age because it's highly collinear with age_bucket, and it’s a continuous feature. In addition, I decided to drop sex and fare because their correlated features would contribute equally to the model. As a result, a total of 7 features were dropped in the next iteration — age, sex, fare, is_alone, is_mix_group, and is_one_family

Calculate feature and permutation importance
Generate hierarchical clusters and Spearman’s correlation matrix

Iteration 2

Let’s take a look at the updated results. In this iteration, I worked with 9 features, which represent the new X_train and X_test.

X_train.info()

I used the output_model_stats function to compare the performance metrics against the original random forest metrics. The iteration 2 metrics are attached to the “rf_base — iteration 2” index. Overall the model performed slightly less effectively than the model with all the features. Most likely, the first random forest model was overfitting (relatively low bias and high variance) and should not be a source of concern at this stage.

Keep in mind that X_train only contains 9 features, as shown above.

Based on the updated feature and permutation importance ranking, embarked was very close to zero in both. I decided to drop this feature in the next iteration.

I decided to drop fare_bucket as well as age_bucket. In my mind, the average fare_per_passenger calculation, which was used to create the fare buckets, most likely did not have enough data to generalize well and fare outliers lingering in the training data.

Similar logical thinking fell onto age_bucket after observing how high it ranked versus pclass and is_woman_child using permutation importance. At the same time, fare_bucket and age_bucket correlated with one another. My intuition told me that keeping these features would most likely decrease bias and increase variance. Therefore, the model's effectiveness to generalize would be reduced.

As the dendrogram illustrates, the first cluster from the left under 0.5 is composed of pclass and cabin_level. The distance between pclass and cabin_level is very close, and Spearman’s correlation matrix shows these being correlated. I was hesitant to drop either pclass or is_woman_child because both features showed a high correlation to passenger survival during my exploratory analysis.

Random forest’s randomness of creating root nodes and splitting to create internal and leaf nodes reduce the effects of collinear features but never completely. In my opinion, it does not hurt to use both collinear features in this scenario using random forest.

Iteration 3

The updated X_train and X_test datasets contain 5 features at this stage, which I have deduced to be most relevant.

X_train.info()

Next, I outputted the performance metrics (“rb_base — Iteration 3”) and compared them with earlier results. The accuracy_cv_score was higher than the previous two iterations. Precision improved quite a bit, which meant more true positives (TP) were found within all positives (TP + FP) detected by the model. This was a good sign. Recall dropped slightly as expected, which meant the model’s ability to detect true positives (TP) among actual true positives (TP + FN) outstanding had fallen a bit. F1 score improved from iteration 2, and the AUC under the ROC curve moved up to a new high.

After taking a closer look at feature importance, permutation importance, and dendrogram accompanied by Spearman’s correlation matrix, I decided these features would generate my first Kaggle submission file. At this point, I was not concerned about one pair of correlated features.

I outputted the predicted y_test (y_pred_base) using the trained model with X_test. I created my submission file and submitted it to Kaggle. I was able to score 80.382% accuracy with this submission. I think this is a pretty respectable score.

rf_base = RandomForestClassifier(n_estimators=100, random_state=0)
rf_base.fit(X_train, y_train)
y_pred_base = rf_base.predict(X_test)
df_output = pd.concat([test_df[‘passengerid’], y_pred_df], axis=1, sort=False)
df_output = df_output.rename(columns={“passengerid”: “PassengerId”})
df_output.to_csv(‘gender_submission.csv’, index=False)

RandomizedSearchCV: Random Forest Classifier

A good explanation of RandomizedSearchCV is found on Scikit-Learn’s documentation page. It’s good to know Python’s approach to OOP. The model class objects in Scikit-Learn contain parameters, attributes, and methods.

“The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. It is highly recommended to use continuous distributions for continuous parameters.”

I’m focusing on the 6 hyperparameters listed under the rs_grid variable. For a detailed overview of all parameters, the documentation page contains a plethora of information.

Random Forest Hyperparameters

  • n_estimators: This represents a number of decision trees in the forest. The default value is set at n_estimators=100.
  • criterion: This function measures the quality of the split. For example, age is a feature that can be a root node. Based on the split criteria, it splits into internal and leaf nodes until tree growth is stopped when the lowest gini impurity is reached for a branch (default is set to gini and the second option is entropy for information gain).
  • max_features: The number of the random subset of features to consider when looking for the best split.
  • max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split: The minimum number of samples required to split an internal node.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

RandomizedSearchCV Parameters

  • estimator: model class instantiation used to conduct a randomized search for the best parameters.
  • param_distributions: Dictionary with parameters names (str) as keys and distributions or lists of parameters to try.
  • n_iter: Number of parameter settings that are sampled.
  • cv: Determines the cross-validation splitting strategy.
  • verbose: Controls the verbosity; the higher the verbosity number, the longer the output/log messages.
  • random_state: Pseudo-random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions.
  • n_jobs: Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

The randomized search took about 5 minutes using all my processor power (n_jobs=-1). The param_distributions contain 8,232 combinations of settings (7*2*2*7*7*6). For my randomized search, I set cv to 5, which equals the number of stratified Kfolds. Therefore, if I started with GridSearchCV, a total of 41,160 parameter settings or fits would be tried. This would take a very long time to run. However, with RandomizedSearchCV, it samples n_iter=200 from total possible settings and thus lowering the number of tasks or fits to 1,000 in this case. Here are the best hyperparameter values from this randomized search.

best_params_ attribute of RandomizedSearchCV
By setting verbose=3, you get a nice output of what’s going on.

Let’s compare the results from my previous runs. The accuracy_cv_score increased by approximately 1.1% and accuracy_cv_stddev went down to about 4%. The precision score improved as well (91.47%) while recall went down a bit. The overall F1 score improved. The AUC score stood steady at 89%. Using what I have learned thus far, the goal is to run a grid search on a smaller set of settings and measure incrementality.

GridSearchCV: Random Forest Classifier

GridSearchCV is similar to RandomizedSearchCV, except it will conduct an exhaustive search based on the defined set of model hyperparameters (GridSearchCV’s param_grid). In other words, it will go through all of the 41,160 fits from above. However, I’m leveraging the learnings from earlier and reducing the list of values for each hyperparameter. For a detailed overview of GridSearchCV’s parameters, take a look at the documentation page.

“The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.”

best_params_ attribute from GridSearchCV
verbose=3

Overall, there wasn’t a whole lot of improvement, except AUC went up slightly. Either way, this demonstrated that tuning the right set of hyperparameters and covering a wide range of settings can improve model predictions. Therefore, learning to leverage RandomizedSearchCV and GridSearchCV becomes an important part of the machine learning workflow. However, you also need to know how much time and energy you want to put into the tuning process because the gain can be minimal.

As the last step, I generated the updated submission file and submitted it to Kaggle. Even with hyperparameter tuning, my score stayed at 80.382% for this set of features and this set of optimized hyperparameters.

Conclusion

Latest Results

Through trial-and-error and expanding the hyperparameter settings, I reached the current standing score of 80.861%, which, according to Kaggle, falls within the top 6%. In my opinion, this is a pretty solid score.

Final thoughts

  • This project reinforced the idea that feature engineering is powerful. Five out of the six features used to predict survival were engineered features, which captured information that the original features could not.
  • Additionally, having enough context (reading about Titanic) on the subject matter was helpful, which helped during the exploratory analysis stage.
  • Spending enough time to explore (slicing and dicing) the data helped build intuition, which helped with feature engineering, aggregating, and grouping the dataset.
  • As these pieces fell in place, techniques like grid search and randomized search helped improve the model incrementally.
  • Lastly, I believe we can’t easily dismiss the significance of domain knowledge and guided trial-and-error in machine learning. These can be created into inputs, and the models can help quantify their importance.

Please feel free to share your comments, feedback, and/or questions. If you’re interested in exploratory analysis and feature engineering, check out my first article — “Kaggle Titanic Competition in SQL.” Thanks for reading!

Resources

I recommend these three books if you are looking for good reference books on machine learning, data analysis, and improving your Python programming skill.

--

--