Pytolemaic series

Data Exploration vs Insights

Why you should stop doing data exploration and switch to automatic insight generation.

Orion Talmi

Published in

Towards Data Science

8 min readMay 27, 2020

Motivation

All data projects start with data, and the work starts with Data Exploration. By performing data exploration, the explorer aims to get familiarized and to understand the data, discovering interesting facts and trends such as corruption in the data, correlations, class imbalance.

However, the final goal of the data explorer is not interesting facts or trends — it’s actionable Insights. These data insights are ready-for-use — may it be for cleaning up the data, for improving the model performance, or for supporting business KPIs.

It’s not an easy process. It’s quite hard to understand even a small dataset — a table of a few thousands of samples and a mere dozen of features — and it’s even harder to sieve for insights. In this post, I will argue why data exploration is the wrong approach to discover useful data insights and suggest an automatic insight generation approach to replace it. I will share some results of this insight generation approach as well as my conclusions.

Data Exploration, and why you shouldn't do it

One often starts the process of data exploration by looking at some samples, which is nice but limited to a few hundreds of samples at most. Going onward he/she would look at some statistics (e.g. min, max, freq, count) and plot some histograms, searching for interesting insights. Depending on one’s creativity and curiosity, it’s quite easy to produce a dozen of statistics and graphs for every feature in the dataset, which can accumulate to hundreds and even thousands of graphs.

For example, let’s look at UCI’s Adult dataset (link). It’s a small dataset — about 50K instances and 15 features — for which this Kaggler has trained a simple Logistic Regression model, but not before plotting more than 20 statistics tables and graphs.

Can’t see the forest for the trees

Histogram graphs for all features in the Adult dataset (source: *author)*

Our meager human mind cannot handle too much information. Too many statistics/graphs will result in going over them without actually seeing them.

Finding needles in the Data Exploration hay stack

Maximal value for all features in the Adult dataset (source: *author)*

Putting aside the work required to create a statistics table/graph, it takes a substantial effort to read and understand what’s in it. For instance, calculating the maximal value for each feature is trivial, but to note that a specific feature, (e.g. ‘capital-gain’ or ‘hours-per-week’ in the Adult dataset) has a suspicious maximal value — this is almost impossible.

Information is useful only if you can act upon it

Data exploration aims to obtain information on the data through various functions and techniques. However, do we need all this information? The answer is no, we do not. Any information which won’t trigger action is useless, and we don’t need useless information, however interesting as it may be.

For instance, let’s look at a histogram plot for the Occupation feature in the Adult dataset. We can see how many people there are in each occupation.

The histogram for feature Occupation in the *Adult* dataset (source: *author)*

If that was your dataset — would you care that the dataset has 994 samples of fishermen and 3650 samples of salesmen? It could’ve been 1515 and 2317 respectively, right? So why looking at a histogram in the first place if a difference of 50% in a bin’s height has no impact and won’t trigger a second thought?

Insights, and how to get them

The answer is quite trivial — we do data exploration as we search for interesting information on which we can act upon, in other words - Insights. Looking at the example above we can find some interesting facts:

“There are only 9 people with the occupation of ‘Armed-Forces’”.
“There are 1500 people with occupation unknown”.
“There are 4 times more managers than fishermen”.

However, only the first 2 sentences should be considered as insights since they will trigger an action.

“There are only 9 people with the occupation of ‘Armed-Forces’”. → triggering a “drop these samples from the dataset” action or “merge ‘occupation’ class with another class”.
“There are 1500 people with occupation unknown”. →triggering a “verify data correctness” action followed by a “deal with missing values” action.

If you first plot a graph, and then ask yourself what you can learn from it — then you are doing it wrong. Instead, first decide what you want to learn, then plot the graph.

Plotting an interesting graph and then trying to discover insights is what most people do. It’s challenging and fun, but not very effective. A better approach would be to define a ‘question’ (Q) or ‘hypothesis’ (H) and a corresponding ‘action’ (A) based on the answer to the question. For example:

H: Number of outliers outside the 5-sigma range should be ≤1.
A: Drop samples.

In this approach, insights (I) are easily discovered as they are the fact that triggers the action — an answer to a question (Q) or hypothesis (H) not fulfilled. Some examples based on the Adult dataset:

H: Number of outliers outside the 5-sigma range should be ≤1.
I: 709 outliers are outside the 5-sigma range for feature ‘capital-gain’.
I: 244 outliers are outside the 5-sigma range for feature ‘capital-loss’.
A: Drop samples.
H: Maximal value for a numerical feature is not of the format 10**k-1.
I: There are samples for which the ‘capital-gain’ is 99999.
I: There are samples for which the ‘hours-per-week’ is 99.
A: Verify the correctness of these values.
Q: Are there any negative values in positive only features?
I: None (There are no such values).
A: If there are, replace with N/A.

Let’s see how to put it into code:

Code examples for 3 types of insights

Insights are not limited to data

In the previous examples, we’ve focused on insights originating from the dataset. However, we might find additional insights if we define some ML hypotheses. For example:

H: Test-set and Train-set has a similar data distribution
H: The are no features with 0 importance in the dataset
H: If the primary metric shows good results then other metrics also have reasonable values.

In most cases, it’s quite easy to answer these questions or to check the hypotheses without using graphical means and even without a human in the loop. Thus, this process can often be automatized, and since many of the hypotheses are not dataset dependant, it’s not hard to scale. One may even carry forward this concept and define a suite of unit-tests for the dataset.

Automatic insight generation

I’ve taken onto myself to try and implement such auto-insight-generator as part of the Pytolemaic package (read more here). I’ve defined and implemented a set of generic hypotheses, which I later tried on a couple of datasets. Before discussing my conclusions, let’s see the results of those trials.

Insights examples

Titanic dataset — 900 samples X 11 features

Auto-generated insights for the Titanic dataset

Insights tell us: There are 5-sigma outliers — need to check. Also, a couple of features (Name/Ticket/Cabin) are not used correctly.

California House prices dataset — 20k samples X 8 features

Auto-generated insights for the California house prices dataset

Insights tell us: There are 5-sigma outliers — need to check.

Adult dataset — 32K samples X 14 features

Auto-generated insights for the Adult dataset

Insights tell us: There are 5-sigma outliers — need to check. Also, there are suspicious values for hours-per-week (99.0) and capital-gain (99999.0).

Sberbank house market — 30k samples X 291 features (a Kaggle competition)

Auto-generated insights for the Sberbank house market dataset

Insights tell us: There are ~250 insights, most of them are 5-sigma outliers — need to check. Also, small classes in sub_area should be merged into an ‘other’ class.

The code for generating the Titanic insights using Pytolemaic package is given below, in case you wish to try it yourself.

My conclusions

The implementation of these insights was very interesting, and not as hard as I’ve expected. It showed me it’s possible to implement generic auto-generated insights and obtain useful insights into the data. The requirements for such insights generation are experience, creativity, and coding skills — a combination of skills that any senior data scientist has. However, there are several challenges I had to face while creating the insight module, making the process challenging.

Usefulness
Making the insights useful is not much of a challenge due to the process involved in defining the question/hypothesis. Since the action is part of the definition, any insight obtained is actionable, and thus, useful. However, actions should be part of the insight module’s output since the reader may not know what action to perform.
Short, clear, clean and easy to understand
By looking at these auto-generated insights, it’s quite obvious that choosing the right format is crucial. For some insights, I had difficulty choosing a good format, and it shows. Insights that are hard to read or hard to understand will be overlooked and ignored, missing the information they contain.
For instance, in the above examples, the insight analysis was performed on a dataset where categorical features have already passed ordinal-encoding — making sentences like “class ‘1.0’ has only 3 representations’ be very cryptic as one will have a hard time to figure out what exactly is class ‘1.0’?
Full information
Insights should contain all relevant information that the data scientist would need for taking action. For example, the ‘# of outliers’ insight does not provide the maximal and the minimal values encountered. As a result, it’s up to the reader to look up (i.e. write some code) these numbers for each feature with outliers insight — and there are many of those. Note, however, that providing additional information is not as simple as it seems if one wishes to maintain short & clear insights.
Sophistication
The quality and the usefulness of the insights strongly depend on the effort put into the analysis. For instance, the ‘# of outliers’ assumes a normal distribution, which may not be the case for many features (e.g. ‘Age’, ‘hours-per-week’). Advanced hypotheses may be defined to unravel hidden insights such as a high correlation between features, anomalies in the data, temporal drifts, and many more.
Comprehensiveness
Since the number of possible issues in a dataset/model is close to infinity, it’s rather impossible to create a comprehensive suite of insights. Comprehensiveness can only be pursued by ongoing improvements from lessons learned.
Too many insights
After overcoming all the challenges above, one major challenge will remain — how to avoid generating too many insights. When there are too many insights, the user would have a hard time going through them all. The Sberbank example exhibits such behavior. I believe we would need to implement some algorithms to aid us to sieve the more important insights our of the list.

To summarize

Data exploration is unable to scale due to our limited human capabilities. As a result, we need to use a different approach — automatic insight generation. In this approach, we define a set of question-action or hypothesis-action pairs. Then the question/hypothesis is applied to the dataset, providing an answer which is actionable insight.

Through a couple of examples, we saw the potential of this approach, as well as the challenges. I believe that in the near future the world will go beyond manual data exploration, replacing it with powerful algorithms capable of providing actionable insights in a scalable way.