DATA ANALYSIS

Ever heard a co-worker confidently declaring something like "The longer I lose at roulette, the closer I am to winning?", or had a boss that demanded you to not overcomplicate things and provide "just one number", ignoring your attempts to explain why such a number doesn’t exist? Maybe you’ve even shared birthdays with a colleague, and everyone in office commented on what a bizarre cosmic coincidence it must be.
These moments are typical examples of water cooler small talk – a special kind of small talk, thriving around break rooms, coffee machines, and of course, water coolers. It is where employees share all kinds of corporate gossip, myths, and legends, inaccurate scientific opinions, outrageous personal anecdotes, or outright lies. Anything goes. So, in my Water Cooler Small Talk posts, I discuss strange and usually scientifically invalid opinions I have overheard in office, and explore what’s really going on.
So, here’s the water cooler opinion of today’s post:
The descriptive statistics match perfectly, so the datasets are basically the same. No need to dig deeper.
Sure, that might make some sense – if you’ve never taken a statistics class.🙃 But in reality, very different datasets can have the same descriptive statistics measures, as for instance mean, average, or standard deviation. In other words, while descriptive statistics describe a dataset, they don’t define it. One needs to always plot the data, in order to get the full picture.
Anyways, one of the first people to demonstrate this was Frank Anscombe with his now-infamous Anscombe’s quartet.
🍨 DataCream is a newsletter offering data-driven articles and perspectives on data, tech, AI, and ML. If you are interested in these topics subscribe here.
What about Anscombe’s quartet?
So, Anscombe’s quartet is a set of four datasets with almost same descriptive statistics that nevertheless, look very different when visualized. The dataset is in Public Domain and is conveniently available in seaborn library, allowing us to play around and explore what is happening. We can easily load the dataset by:
import seaborn as sns
data = sns.load_dataset("anscombe")
Then, we can create a visualization of the four datasets using matplotlib:
import matplotlib.pyplot as plt
sns.relplot(
data=data,
x="x", y="y",
col="dataset", hue="dataset",
kind="scatter",
palette="deep",
height=4, aspect=1
)
plt.suptitle("Anscombe's Quartet", y=1.05)
plt.show()

In particular, all four datasets consist of 11 (x, y) points:
- dataset I seems to be a simple linear relationship
- dataset II clearly is a parabolic curve
- dataset III clearly is a linear relationship, but with one large outlier
- and dataset IV is a vertical line, but again, distorted by one large outlier
All four datasets are very different from one another, however, when we calculate their descriptive statistics, we are up for a plot twist – all four datasets share almost identical descriptive statistics. More specifically, we can calculate the descriptive statistics as following:
import pandas as pd
from scipy.stats import linregress
def calculate_statistics(group):
mean_x = group['x'].mean()
var_x = group['x'].var()
mean_y = group['y'].mean()
var_y = group['y'].var()
correlation = group[['x', 'y']].corr().iloc[0, 1]
slope, intercept, r_value, p_value, std_err = linregress(group['x'], group['y'])
r_squared = r_value ** 2
return pd.Series({
"Mean of x": mean_x,
"Variance of x": var_x,
"Mean of y": mean_y,
"Variance of y": var_y,
"Correlation of x and y": correlation,
"Linear regression line": f"y = {intercept:.2f} + {slope:.2f}x",
"R2 of regression line": r_squared
})
statistics = data.groupby("dataset").apply(calculate_statistics)
for dataset, stats in statistics.iterrows():
print(f"Dataset: {dataset}")
for key, value in stats.items():
print(f" {key}: {value}")
print()

Identical numbers, wildly different visuals. 🤷♀️ Crazy, right? This is why it is so important to always visualize the data, no matter what the numbers suggest. In Anscombe’s own words, a common but misguided belief among statisticians is that "numerical calculations are exact, but graphs are rough".
Anscombe’s quartet is such a powerful example for highlighting the importance of data visualization, because the visualizations are not just different, but rather clearly different – with just one glance, one can understand that the datasets are completely distinct from each other. In other words, the visualizations immediately provide us with meaningful information, that the descriptive statistics fail to incorporate.
No one knows exactly how Anscombe came up with these datasets in the first place, but this 2017 study presents a method for creating such datasets from scratch. This allows to produce endless examples of very different datasets with almost identical descriptive statistics – my favorite by far is Datasaurus dozen 🦖. Similarly to Anscombe’s quartet, the Datasaurus dozen includes thirteen – a dozen + the Datasaurus – very different datasets with almost identical descriptive statistics.
The datasets are available in the datasauRus R package under the MIT License, which permits commercial use.


Again, all of these datasets have almost identical descriptive statistics, however, they are strikingly different. Apparently, skipping the visualization step, would result in a terrible loss of information.
When a plot could have helped
Anscombe’s quartet or the Datasaurus may seem like fun, simplistic examples, aiming to teach us about the importance of data visualization. But, don’t be fooled; the lesson – the importance of data visualization – is not a theoretic concept, but rather very real, with tangible implication in the real world.
1. 2008 financial crisis
Take for instance the 2008 financial crisis. The models used by banks and investment firms back then, heavily relied on aggregate risk metrics like _Value at Risk (VaR). More specifically, VaR_ provides an estimation of the potential maximum loss of a portfolio over a given time frame at a specified confidence level (e.g. 95% or 99%). For instance, a 99% VaR of $100 means that under normal market conditions, losses are expected to exceed $100 only 1% of the time.
VaR calculations are based on a bunch of assumptions, as for instance the often used assumption that market returns are normally distributed. Nevertheless, as you may have already figured out by now, ‘normally distributed‘ is usually a rather sloppy interpretation of the real world – real life is in most of the cases much more nuanced and not so straightforward.
Anyways, in the case of 2008 financial crisis, the assumed ‘normal distribution’ failed to represent reality and account for the ‘fat tails’ appearing in the actual data of market returns. A ‘fat tail’ in a distribution represents a higher-than-expected (from the normal distribution) probability of extreme events – here, extreme losses. In other words, extreme losses occurred much more frequently that the used models assumed.

scipy.stats
Nonetheless, focusing on aggregate metrics like VaR does not allow to identify those insights. On the contrary, a detailed visualization of historical returns can provide deeper insight into what is really happening. Individuals and institutions that use visualizations and detailed analysis, di identify the risks.
Ultimately, VaR is a great example of how often people tend to feel confident and secure when they are presented with a number – a calculation – irrespectively if this number is calculated out of thin air. At the core of 2008 financial crisis, overreliance on quantitative models, led institutions to greatly underestimate risks. Apparently, an abundance of other factors contributed to the outcome – say for instance the systemic risk of interrelated portfolios – nevertheless, blind trust in aggregate numbers remains a strong factor.
2. Challenger space shuttle disaster
Even in the Challenger space shuttle disaster, some data visualization could have helped. For reference, on January 28, 1986, the Space Shuttle Challenger broke apart 73 seconds after takeoff due to a failure of the O-rings separating the sections of the rocket booster. The failure of the O-rings occurred because of extremely low temperatures (2°C / 35°F, reaching -8°C/17°F the night before). For a large period prior to the launch, Morton Thiokol – the NASA subcontractor who manufactured the O-rings – engineers were concerned about the performance of O-rings in low temperatures, and even explicitly recommended against the launch. According to the accident report they presented the relevant data to their company’s management as appearing in the following picture. Nonetheless, their concerns were ignored by the company’s management, which eventually recommended NASA that it’s ok to launch.

Said illustration of the O-ring failure in relation to temperature contains a lot of data, however, the point of the illustration is not clearly communicated. Edward Tufte in his book ‘Visual Explanations’, argues that a simple, clear graphical representation of the correlation between temperature and O-ring failure could have gone a long way, even resulting into a different decision about approving the launch. Allegedly, Morton Thiokol also presented NASA a scatter plot of O-ring failure incidents in relation to temperature – but the plot was missing the flights with no failure, so it didn’t make much sense. A better plot, also including the flights with no O-ring incidents, is presented in the ‘_Report to the President By the Presidential Commission On the Space Shuttle Challenger Accident._ Apparently, this is a much more useful plot, allowing one to immediately suspect the correlation between O-ring failure and temperature. Tufte’s critique emphasizes how this omission contributed to poor decision-making.

Be that as it may, with today’s historical and technological distance from those events, it’s easy to find fault in cases of the past. Nonetheless, it is heartbreaking to realize that in many cases disaster can be foreseen, but cannot be effectively communicated to decision makers in order to take the right decisions.
3. 1854 Broad Street cholera outbreak
A remarkable use case of effective data visualization is the case of 1854 Broad Street cholera outbreak. In particular, this outbreak occurred in Soho, London, near Broad Street, during the 1846–1860 cholera pandemic and killed 616 people.
During this outbreak Dr. John Snow managed to illustrate the cholera cases in a map, and realized that they clustered around a water pump in Broad Street. This resulted in identifying that the Broad Street water pump was contaminated and was essentially the source of the outbreak. The simple act of removing the handle of the water pump led to cessation of the epidemic.

This was a huge discovery that not only stopped the epidemic, but also revolutionized the understanding of disease transmission and epidemiology. Undeniably, a great example of how a simple visualization can go a long way and unlock impactful insights.
On my mind
Ultimately, the lesson from datasets like Anscombe’s quartet or the Datasaurus dozen is that our deep-rooted notion that ‘numerical calculations are exact, but graphs are rough’ is flawed. Both visualization and numerical calculation are essential in order to extract meaningful insights from data. In the end, interpreting data in a meaningful way may be more of an art than an exact science, as there is no single, one-fits-all calculation approach that should be followed. Data visualization is much more than pretty pictures – it is a necessity for avoiding misinterpretation and poor decisions in Data Analysis…
…cause a Datasaurus maybe lurking somewhere in the data.
🦖
Data problem? 🍨 DataCream can help!
- Insights: Unlocking actionable insights with customized analyses to fuel strategic growth.
- Dashboards: Building real-time, visually compelling dashboards for informed decision-making.
Got an interesting data project? Need data-centric editorial content or a fancy data visual? Drop me an email at 💌 [email protected] or contact me on 💼 LinkedIn.
💖 Loved this post?
Let’s be friends! Join me on 💌 Substack 💼 LinkedIn ☕ Buy me a coffee!
or, take a look at my other Water Cooler Small Talks: