First, We Must Discover. Then, We Can Explore.

A case for a structured data-discovery approach

Viyaleta Apgar
Towards Data Science

--

Photo by Clarisse Meyer on Unsplash

Back in the slammin’ 70s, John Tukey published Exploratory Data Analysis, through which he championed the idea of playing around with our datasets before jumping into hypothesis testing. Tukey argued that in doing so, one can uncover new information about the data and the topic in question and develop new hypotheses that might lead to some interesting results. He wrote¹,

It is important to measure what you CAN DO before you learn to measure how WELL you seem to have DONE IT.

Since then, exploratory data analysis (EDA) has grown in popularity and today, it would be quite difficult to find a Kaggle challenge submission notebook which does not start with an EDA.

If you’ve ever been curious enough to read Exploratory Data Analysis (and you have done so recently), you’d probably find it to be filled with lots of outdated techniques — like how to easily find logs of numbers and how to plot data points by hand. But if you’ve braved the book and either scuffed or pondered about how far we’ve come from such prehistoric tutorials, you’d find lots of useful examples and ideas. Primarily, you’d see that the EDA is not a specific set of instructions one can execute — it is a way of thinking about the data and a way of practicing curiosity.

However, while Tukey brilliantly describes EDA and its techniques, his book omits the first and frequently overlooked step in data analysis: understanding the topic. While this idea may be intuitive to some, not everyone may actually do it in practice. And although it may be fun to jump into programming or into making beautiful visualizations — EDA may be misleading if we do not understand what our dataset represents. So, before beginning our data exploration, we should also be curious about the topic or process which our data attempts to describe.

Specifically, we should ask ourselves two questions:

  1. What do we know?
  2. What do we not know?

Upon an attempt to answer the questions, we should be able to build a frame of reference which we will use to pursue our analysis.

Knowledge is Power

When attempting to solve a math problem, a good strategy is to first and foremost write down everything that is known about the problem. Similarly in data analysis, if we already have a dataset we plan to analyze, it is natural to want to know what the data represents. If we don’t yet have the dataset, it is just as natural to ask questions about our topic in order to collect appropriate requirements for the dataset and to understand the end-goal. In this section, I propose a structured way to collect the facts about our analysis. In fact, the question “what do we know?” can be divided into three separate “what” questions.

What is the subject matter?

While subject matter expertise can be left to the experts, a proficient data analyst should investigate the subject and learn everything they can about the topic. The reason to do so extends beyond pure curiosity. Understanding the subject matter can help identify what information is needed for the analysis and can help gather specific requirements. When working with an existing dataset, it can help during the EDA. Moreover, it can help analysts avoid doing redundant work.

For example, if we know that a company announces quarterly earnings to the public then it could help explain a why stock price undergoes a sudden change on quarterly basis. An analyst can add this to a list of known facts and save some time in digging up information during EDA when doing an analysis of the company’s stock price fluctuations. Moreover, the analyst could request quarterly financial statements as an additional data requirement.

What are the definitions?

Before proceeding with the analysis, it’s important to put together a dictionary of definitions and known terms. Having a dictionary available could help uncover certain nuances in the analysis, understand logic involved in various calculations, and communicate with the stakeholders. Compiling a dictionary could also raise a number of additional questions and hypotheses to help the analysis.

If you are given a wine quality dataset (like this one) and asked to predict the quality of wine, you could just import the dataset, import scikit-learn, and run a model. But if you took the time to build a dictionary of terms, you’d understand that “volatile acidity”, for example, is defined as the measure of fatty acids with low molecular weight in wine and is associated with wine spoilage. So if your model predicts that it positively contributes to the predicted quality of wine, it may be time to revisit your model or be prepared to justify this outcome to your stakeholders.

What are the underlying processes?

The last step in attempting to collect the knowns is to understand what underlying processes govern the subject of your analysis. Typically, this should be done using a systems analysis which identifies processes and should help raise and answer various questions about the data. It may also serve as a tool to guide the analysis.

Let’s say that you are tasked in identifying factors which contribute to your company’s growth. Developing a systems diagram of growth paths can be a good starting point in understanding what data should be collected and in testing various hypotheses. For example, a tree diagram could highlight that there are three ways that your company could grow its perceived income: sales to new clients, increase in revenue from existing clients, or decrease in number of churned clients. From those factors, you can begin to build a detailed picture. For one, you could identify various ways in which your company acquires clientele and identify the data needed in order to verify if that those factors are important.

Ignorance is Bliss

Once we have a better understanding of everything we do know about our subject in question, it is time to ask ourselves “what do we not know?” While we could collect additional data in some instances, the true goal of this question is to understand our limitations and requirements which we cannot satisfy. It is also necessary to understand our biases and assumptions under which we will be making recommendations.

Is the information complete?

To answer this question, let’s take a look at the data we are able to get or the data we already have and assess the following criteria:

  1. Does the data portray the entire picture or only part of it? For example, if we are tasked with analyzing a stroke data set from Electronic Patient Records (EPR), we are only looking at individuals who have a record and not everyone who may have had a stroke. What does that say about our analysis and any recommendations we plan to make?
  2. Are there missing data points? It is typical that a dataset would have missing records or columns of data. Frequently, an analyst would have to devise a strategy on dealing with the missing data points. However, that strategy could depend and change based on why the data point is missing.
  3. Is the data of good quality? In many instances, data transformation, errors in data collection, or manual inputs can lead to poor data quality. Before using the given information, an analyst should first — confirm if the data is correct and cross reference with existing sources if possible and second — develop a strategy to mitigate poor quality and uncertainty.

Let’s say you are tasked with analyzing product reviews on the website (like these) in order to derive trends and insights and influence future product stocking. Such a dataset relies on individuals who have entered their product reviews. However, not all individuals do that. Is this dataset, then, representative of the entire population? If we perform natural language processing on it, do we expect that the quality of the reviews is good or should do some preprocessing to make some entries more legible? If we don’t do preprocessing, are we missing certain information? Can we get any additional data or results to improve our analysis?

What assumptions are we making?

Given situations where data is incomplete or of poor quality and it is not possible to obtain additional information, an analyst may have to accept their fate and design strategies in order to continue with the analysis. However, it is important to note the assumptions that must be made as they will set the boundaries and scope for interpretation of analysis results. Making assumptions comes with a risk and a decision maker must be informed of the risk that they are willing to take in order to utilize the results of the analysis.

For example, if we are looking to analyze a dataset of happiness scores across the globe (like this one) and we cannot obtain any additional information, we must work under several assumptions. Primarily, if we want to make inferences about the general population among the represented countries, we must assume that the results of individual countries represent their respective populations. This means that individuals who participated in the survey represent the other people in their respective nations (which is hefty assumption). We must also assume that the methodology employed by the Gallup World Poll which conducted the survey is not biased towards specific populations or methods (which is also a hefty assumption). Another assumption we must make is that the translation of the survey into different languages did not hinder the results.

Are we biased?

The final question which will determine the strategy for the analysis and interpretation of analysis results deals with bias — an inclination or prejudice towards a point of view. Biases arise from the way that people perceive the world and from their experience and interactions with others. Without understanding types of biases and checking our understanding of information through an unbiased lens, the insights we derive may not represent reality. Instead, they may be prejudiced toward a specific point of view. Biased analysis not only underrepresents reality, it can also be unfair and unethical.

In this last example, let’s take a look at a dataset of US presidential debate transcripts (like this one). An analysis of this data may be impacted by recency bias, in which an individual may place more importance on the more recent debates and forgo taking into account that culture changes and evolves through history. Additionally, an analyst with a strong political point of view may face confirmation bias in which they might ignore evidence that does not support their point of view.

This post makes a case for a structured approach to data discovery — a process which must happen before the exploratory data analysis can take place. The goal of this process is two-fold: to understand what information is known and to evaluate what information is not known. Answering questions to help get to the goal of discovery allows us to develop a perspective and a lens through which the results will be interpreted. It also helps us understand if our insights can be used to make decisions and what kind of decisions are we limited to.

Before jumping into exploration of the unknown lands, we should discover those lands, understand them, survey them, and prepare ourselves for the journey. Only then we’ll be able to explore safely and with confidence.

I hope you enjoyed the read! I can’t wait to read your thoughts in the comments section.

References

[1] Tukey, J. (1977) Exploratory Data Analysis. Addison-Wesley Publishing Company Inc.

--

--

Data-informed decision making, data science, insights and recommendations.