Data Storytelling with Python

Visualizing Statistics with Python — Telling Stories with Matplot

Crafting a narrative for the Titanic Dataset

Tonichi Edeza
Towards Data Science
6 min readFeb 4, 2021

--

Photo by Clay Banks on Unsplash

One of the most important skills a data scientist can learn is how to craft a convincing story using the given data. Though we tend to think of our work as being objective and technical, encapsulated in the adage “Numbers don’t lie”, we should also be aware of its more subjective aspects. We should not fall into the trap of viewing our work as completely detached from our own impressions and preconceived notions of the world.

In this article I will go over the many ways the same data set can be used to craft different (and sometimes conflicting) narratives about the topic.

Let’s begin!

As always let us import the Python libraries we shall use.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

Excellent, let us now import the data we shall be using.

df = pd.read_csv('titanic_data.csv')
df.head()
First 5 Rows of the DataFrame

As you may have noticed, we are using the classic Titanic Data which can be found here in Kaggle. I encourage new data scientists to take a look at it as it is an excellent way to begin learning about basic statistics.

Let’s go for the obvious feature first, the survivor count.

fig, ax = plt.subplots(1, figsize=(5,5))
plt.suptitle('Frequency of Survival', fontsize=15)
ax.bar(dataset['Survived'].value_counts().index,
dataset['Survived'].value_counts().values,
color = ['darkblue', 'darkorange'])
ax.set_xticks(range(0, 2))
ax.set_xticklabels(['Died','Survived'], fontsize = 14);
Countplot of Deaths vs Survivals (Image generated by Code)

As we can see, the records show that more people died as compared to survived. Though an interesting statistic, there is not much of a narrative we can mine from it. Let us now incorporate some other descriptive features, say gender.

fig, ax = plt.subplots(2,2, figsize=(15,10))
plt.suptitle('Survival by Gender', fontsize=25)
ind = np.arange(2)
width = 0.45
x_axis_status = dataset['Survived'].value_counts().index
x_axis_sex= dataset['Sex'].value_counts().index
male_stats = dataset[dataset['Sex'] == 'male']['Survived'].value_counts().values
female_stats = dataset[dataset['Sex'] == 'female']['Survived'].value_counts().values
ax[0][0].set_title('Stacked Graphs', fontsize = 18)
ax[0][0].bar(x_axis, male_stats, label = 'Male', color = 'darkblue')
ax[0][0].bar(x_axis, female_stats, label = 'Female',
color = 'darkorange', bottom = male_stats)
ax[0][0].set_xticks(range(0, 2))
ax[0][0].set_xticklabels(['Died','Survived'], fontsize = 14)
ax[0][1].set_title('Side-by-Side Graphs', fontsize = 18)
ax[0][1].bar(ind, male_stats , width, label='Male',
color = 'darkblue')
ax[0][1].bar(ind + width, female_stats, width, label='Female',
color = 'darkorange')
plt.sca(ax[0][1])
plt.xticks(ind + width / 2, ('Died', 'Survived'), fontsize = 14)
ax[1][0].set_title('Stacked Graphs', fontsize = 18)
ax[1][0].bar('Male', male_stats[0], label = 'Died',
color = 'darkblue')
ax[1][0].bar('Male', male_stats[1], label = 'Survived',
color = 'darkorange', bottom = male_stats[0])
ax[1][0].bar('Female', female_stats[0], color = 'darkblue')
ax[1][0].bar('Female', female_stats[1], color = 'darkorange',
bottom = female_stats[0])
ax[1][0].set_xticklabels(['Male','Female'], fontsize = 14)
ax[1][1].set_title('Side-by-Side Graphs', fontsize = 18)
ax[1][1].bar(0, male_stats[0] , width, label='Died',
color = 'darkblue')
ax[1][1].bar(0 + width, male_stats[1] , width,
label='Survived', color = 'darkorange')
ax[1][1].bar(1, female_stats[0] , width, color = 'darkblue')
ax[1][1].bar(1 + width, female_stats[1] , width,
color = 'darkorange')
plt.sca(ax[1][1])
plt.xticks(ind + width / 2, ('Male', 'Female'), fontsize = 14)
[axi.legend(fontsize = 14) for axi in ax.ravel()]
[axi.set_ylim(0, 800) for axi in ax.ravel()]
fig.tight_layout()
plt.show();
Chart Comparisons (Image generated by Code)

Our analysis can be more interesting now that we have incorporated gender. By design, we made four different versions of essentially the same graph. Notice how each slight change highlights a different aspect of the data (and hence calls for a different narrative).

The two graphs on top give the impression that one should focus on the gender disparity in passenger death. If we look at the top graphs, we notice that more men died during the sinking of the Titanic. We then may be tempted to jump to conclusions of how this is brought about by “women and children” first policies. However, looking at the bottom set of graphs unveils something a little more subtle.

Unlike the top graphs, the gender disparity in the death count is not what is emphasized. What is actually emphasized is that men and women have noticeably similar mortality rates.

“But how could this be?” You may ask. Men clearly represent more of the deaths. How then can the mortality rate for both genders be similar?

The answer lies in another graph.

fig, ax = plt.subplots(1, figsize=(5,5))
plt.suptitle('Gender Split', fontsize=15)
ax.bar(dataset['Sex'].value_counts().index,
dataset['Sex'].value_counts().values,
color = ['darkblue', 'darkorange'])
ax.set_xticks(range(0, 2))
ax.set_xticklabels(['Male','Female'], fontsize = 14);
Countplot of Passengers by Gender (Image generated by Code)

We can see that men actually vastly outnumber women, a difference of roughly 82%. It would then make sense that men represent a larger amount of those who perished as there were more of them in the first place (see that men make up most of the survivors as well).

Of course this is not to say that there is no bias that emphasizes the protection of women and children over adult men. However, when trying to look for evidence for that particular bias we must cognizant of the “right” data to use. We also have to recognize that the exact same data can look extremely different on what was chosen as the primary classifier (in our case survivor status vs gender). If we were to rely on the first set of graphs we risk overstating the level of gender disparity in the death count.

But now let us step away from viewing the data from the lens of gender and view it from the lens of age.

fig, ax = plt.subplots(figsize=(10,7))
age_died = df[df['Survived']==0]['Age']
age_survive = df[df['Survived']==1]['Age']
n, bins, patches = plt.hist(x = [age_died, age_survive],
stacked = True, bins='auto',
color=['darkblue', 'darkorange'],
alpha=0.65, rwidth=0.95)
plt.grid(axis='y', alpha=0.55)
plt.xlabel('Passenger Age', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.title('Age Distribution of Passengers', fontsize = 22)
plt.legend(['Died','Survived'], fontsize = 15);
Survival by Passenger Age (Image generated by Code)

We can see from the above graph that there is definitely an advantage to being young, as we notice that the only time the survival count is greater than the death count is below the age of 10. Additionally we see that a single octogenarian was able to survive the disaster, perhaps several of the passengers took pity on them and allowed them into a lifeboat.

In Conclusion

We have seen the effect of how subtle changes in the graph can drastic effects on the overall narrative of the data. In actual data science projects I find that there is a temptation to simply make a bar chart and accept the first output of Matplotlib or Seaborn. Though this is useful to expedite the process, we must be careful when we initially view the data. We must try our best to visualize it in multiple ways to ensure that we are not getting the wrong impression about the data. In future articles we will discuss more advanced statistical concepts and the best ways to visualize them. For now I hope this article has helped you realize that our work can occasionally be rather subjective.

--

--