The world’s leading publication for data science, AI, and ML professionals.

Never Skip This Step in Your Exploratory Data Analysis (EDA)!

How Descriptive Statistics Alone can mislead you

DATA SCIENCE. STATISTICS. EDA. PYTHON.

Photo by Lucas Sankey on Unsplash
Photo by Lucas Sankey on Unsplash

INTRODUCTION

If you are new to Data science and have taken a course to do preliminary data analysis, chances are one of the first steps taught into doing exploratory data analysis (EDA) is to view the summary / descriptive statistics. But what do we really intend to accomplish with this step?

Summary statistics are important because they tell you two things that are important for modeling data: location and scale parameters. Location parameters, in statistics, refer to the mean. This lets you know if your data is normal and whether there is potential skewness to help in modeling decisions.

For example, if the dataset is normal, modeling techniques like Ordinary-Least-Squares (OLS) may be sufficient and powerful enough for predicting outcomes. In addition to that, the way we do our initial data cleaning, such as handling of missing data, may depend on whether the data exhibits normal behavior.

Scale parameter refers to the amount of dispersion we have in the data (e.g. standard deviation and variance). The larger this parameter, the more distributed or spread out our distribution is.

So, looking at descriptive statistics is essential particularly for modeling and research design purposes. Thinking, however, that descriptive statistics are enough for EDA may be one of the most costly assumptions a data professional could commit.

To see that let’s perform an exercise in visualization by a popular dataset known as the datasaurus dataset.

THE DATASAURUS DATASET

It is a common view to think that visualization takes center stage only when we are reporting or communicating data/insights/results.

The importance, however, of never missing the visualization portion of the EDA was made apparent when looking into the datasaurus dataset created by Alberto Cairo. The dataset can be found in this link.

An enhanced dataset from Justin Matejka and Autodesk, as well as the dataset that can be downloaded from R’s datasauRus package, came up with 13 virtually similar datasets to further this point. (Permission to use data was sourced from a personal email to the article’s author)

Let’s proceed with generating the descriptive statistics of the dataset.

DESCRIPTIVE STATISTICS

# Preliminaries
import pandas as pd
import numpy as np
df.dataset.unique()
These are the 13 sub-datasets within the datasaurus dozen dataset.
These are the 13 sub-datasets within the datasaurus dozen dataset.
unique = df.dataset.unique()
for i in unique:
    temp = df[df.dataset==i]
    print(f'---Summary Statistics for '{i}' dataset---')
    print(np.round(temp.describe().loc[['mean', 'std'],],2))
    print(f"nCorrelation: {np.round(temp.corr().iloc[0,1],2)} n")
Image by the Author
Image by the Author
Image by the Author
Image by the Author
Image by the Author
Image by the Author

Generating the descriptive statistics for our 12 datasets, particularly the location and scale parameter, including the correlation coefficient, results in virtually identical values.

Running descriptive statistics alone would therefore lead us to conclude that they are virtually the same dataset and can be modeled similarly.

But, let’s visualize these and see how accurate that thought is.

VISUALIZING THE DATASETS

We develop a simple application to view the thirteen datasets interactively using panel:

import panel as pn
import holoviews as hv
pn.extension()
def visualize_scatter(dataset):
    temp = df[df['dataset']==dataset]
    scatter =  hv.Scatter(temp, 'x', 'y')
    scatter.opts(color='#cb99c9', s=50)
    return scatter
pn.extension(comms='default')
select = pn.widgets.Select(name='dataset', options=list(unique))
dmap = hv.DynamicMap(pn.bind(visualize_scatter, dataset=select))
app = pn.Row(pn.WidgetBox('## Dataset', select), 
             dmap.opts(framewise=True)).servable()
app
Interactive Visualization of the Datasaurus Dataset by the Author
Interactive Visualization of the Datasaurus Dataset by the Author

Amazing! The thirteen datasets, while having similar descriptive statistics, have widely different visuals. Modeling them all in one way is therefore dangerous and might lead to dangerous repercussions for decision-making.

FINAL REMARKS

Now that we see the importance of Data Visualization is not only the reporting part of the process, let’s make it a rule to include this in our EDA.

While this article does not stop us from generating and using descriptive statistics, it may actually be good to adopt this habit: visualize first before doing descriptive statistics. While not conventional, visualizing the dataset first allows the data scientist to generate some preliminary direction on how to approve the problem that is being tackled.

For my other visualization articles, check out:

Full code on my Github page.

Let me know what you think!

REFERENCES

Justin Matejka and George Fitzmaurice. 2017. Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1290–1294. DOI:https://doi.org/10.1145/3025453.3025912


Related Articles