The world’s leading publication for data science, AI, and ML professionals.

Strategic Data Analysis (Part 3): Diagnostic Questions

Deep dive into the approach for answering "why" questions

This is part of a series on Strategic Data Analysis.

Strategic Data Analysis (Part 1) Strategic Data Analysis (Part 2): Descriptive Questions → Strategic Data Analysis (Part 3): Diagnostic Questions Strategic Data Analysis (Part 4): Predictive Questions ← Coming soon! Strategic Data Analysis (Part 5): Prescriptive Questions ← Coming soon!


Answering "why" questions can be difficult for any data analyst. Lack of subject matter expertise, lack of technical repertoire, and lack of strategic approach can play an adverse role in helping decision makers find the right answer. However, with a solid foundation and direction, these diagnostic questions can be easily tackled by anyone.

Diagnostic questions frequently follow the answers to descriptive questions. In asking a diagnostic question, the decision maker aims to understand how some piece of information came about or what caused something to happen. Thus, when we think about diagnostic questions, we often think about Causal Inference. Therefore, it is good to be familiar with the general principles of causal inference.

In this article:

  1. Introduction to Causal Inference
  2. Strategy for Answering Diagnostic Questions
  3. A Case Study
  4. A Few Final Notes

Introduction to Causal Inference

Causal inference aims to uncover how interventions (or changes in status quo) effect outcomes. In causal inference, we suppose that causality happens when some intervention, called "a treatment", is applied to some unit and it causes a change in that unit’s outcome. If we were to compare an outcome of a unit with or without the treatment, we would be able to observe the effect of the treatment (i.e. causality).

For example, if we wanted to know if painting our house exterior prior to listing it for sale would make it sell faster, the most ideal scenario would require us to compare time-to-sale with and without painting the house simultaneously. Here, the house is our unit, painting the exterior is our treatment, and time-to-sale is our outcome. However, it is impossible to both paint and not paint the same house simultaneously. Thus, "we can never observe the same unit with and without treatment" [1].

This is where causal inference comes in. Instead of directly measuring the treatment effect on a specific unit, we can instead measure association and bias. Association is the average difference in outcome between all units that received treatment and all units that did not receive treatment. Bias distinguishes association from causation by capturing all the factors under which the outcomes are different.

In our house sale example, we could compare all of the houses that were painted and all of the houses that were not painted and note their time-to-sale. The difference in time-to-sale among the two groups of houses is called "association". If there was no bias, we can determine that painting the house prior to sale causes it to sell faster.

However, it is possible that most of the original owners who decide to paint their house prior to sale can also afford to do so, since they live in a nicer neighborhood; and the houses in the nicer neighborhood tend to sell faster. Hence, the bias could be that the houses are selling faster not only because of the fresh coat of paint, but also because they are in a nicer neighborhood. If we can eliminate this bias (among others), we can determine if painting the house prior to selling it causes it to sell faster.

This is the gist behind causal inference. For a deep dive, I would highly recommend a book by Matheus Facure Alves: Causal Inference for the Brave and True which covers the topic in great level of detail. The basics of causal inference frames the strategy for answering diagnostic questions, so let’s dive into it in more detail.

Strategy for Answering Diagnostic Questions

The reason why diagnostic questions can be difficult to answer is they can require significant knowledge of the subject matter. The general strategy for uncovering why something has happened or is happening requires an understanding of all possible causes and biases, followed by rigorous technical approaches to assess their effect. The understanding of all possible causes takes effort and time to investigate. Thus, most of the time spent answering diagnostic questions is spent in research. Unfortunately, research can sometimes lead an analyst into to various rabbit holes and dead ends. Employing a strategic approach and rigor can help make the process.

In general, the approach to answering diagnostic questions includes:

  1. Identifying the outcome
  2. Identifying probable causes and potential biases
  3. Assessing causality

Before getting started, it is important to note that in nearly all cases, we may not be able to identify an exact root cause of something. Instead, we can identify causes that were the most probable effectors and assess the probability of their effect.

It is important to not only understand this, but to also develop a communication strategy so that the decision maker is aware of this caveat, well before we can even commit to answering their diagnostic questions. In looking for an answer to a diagnostic question, the decision maker takes on a risk. The less certain the answer is – the more risk the answer carries. Therefore, the decision maker has to know that this risk must be weighted when making a decision based on the answer provided.

This caveat out of the way, let’s take a look at the strategy in detail.

Step 1: Identify the Outcome

The outcome in the question is the dependent variable that experienced the effect of some potential cause. Generally, diagnostic questions should only have a single dependent variable. It is important to identify the outcome in order to clearly define it and to verify that it can be measured. If the question has more than one dependent variable, the question should be broken out into separate questions.

For example, in the question from Part 1 "what caused the heat wave", the outcome is the heat wave, which can be defined as a sudden and dramatic increase in temperature. In the question "why are our clients canceling their subscriptions", the outcome we want to investigate is subscription cancellation. If we were posed a question like "why are the housing prices increasing and rent prices decreasing", we should instead answer two separate diagnostic questions: "why are the housing prices increasing" and "why is rent decreasing."

Step 2: Identify Probable Causes and Potential Biases

Once we have identified an outcome in question, we have to list out all possible things that could explain it and help us answer the "why". In general, this process can be broken out into identification of 3 things: causes, biases, and mechanisms of causality. Graphical causal models should be constructed to aid the identification process.

Potential causes can be determined through research, expertise, interviews, and association. Without proper subject matter expertise or access to an expert, this is very difficult to achieve. Therefore, it is necessary to gather as much knowledge as possible about the subject (check out my article First We Must Discover. Then, We Can Explore for more details on why it’s important to build knowledge).

A great tool to utilize when coming up with a list of potential causes is brainstorming. A novel approach to brainstorming is a repeated process that first: necessitates listing as many causes as possible, without judging their validity, and second: going through the list and ensuring that the listed causes are sound and logical.

For example, in order to answer a question from Part 1: "why are our clients canceling their subscriptions", we can first perform research and understand if our churned clients have reported a reason for their cancellation. We can interview our client success team to understand what complains they frequently receive from clients. Then, we can come up with any additional causes through a brainstorming session with our decision makers.

Potential biases can be even harder to uncover than potential causes but will have a significant impact on the answer. Just like causes, biases can be determined by building subject matter expertise. However, unlike potential causes that require mostly knowledge, bias identification usually requires creative and constructive thinking.

A good starting point is to become familiar with common bias types that present themselves in Data Analysis and infer if they present themselves in your use-case. Some common bias types include confirmation bias, selection bias, historical bias, survivorship bias, availability bias, and outlier bias [2] (check out this article on Metabase for more info).

A very prominent example of survivorship bias involves the work that was done by Abraham Wald during World War 2. As part of the Statistical Research Group based in Columbia University, Wald and his team were tasked with optimizing the amount of shielding that war planes should carry: if the planes carry too much shielding – they won’t fly due to their weight; if the planes carry too little shielding – they will not be protected. After analyzing the planes that had returned safely but had bullet holes, Abraham Wald made a recommendation that the shielding should be added to places on the plane that did not have bullet holes (as opposed to shielding locations of bullet holes). Why? Since the analysis included only planes that survived, it is probable that the planes which did not survive had bullet holes in some critical zones. If those critical zones had a hid, they did not make it back, so it makes sense to put shielding on top of the critical zones [3]. Learn the whole story in this article by Alessandro Bazzi.

Mechanisms of causality constitute how potential causes affect the outcome. Without a mechanism of causality, it would be difficult to discern a cause from a coincidence. This plays an important role when selecting a model to infer causality.

A great example of coincidence is the correlation between divorce rates in Maine and consumption of margarine (see the original article). The two trends may be parallel but there is no sound mechanism that would explain why one causes the other. Therefore, we cannot consider an increase divorce rates in Maine to cause an increase in margarine consumption, and vice-versa.

Graphical causal models should be developed to help identify causes and biases as well as the mechanisms that constitute causality. In essence, these models are directed graphs that include all causes and outcomes. Developing a graphical model to understand causality can also help increase our understanding of the subject and can be employed to aid our communications with decision makers.

For example, graphical causal models can help us uncover confounding bias. Our variables from causes and biases do not necessarily only impact the outcome – they can actually affect each other. If some variable impacts our potential cause and our outcome, then we are dealing with confounding bias. In order to resolve this, we should control for all common underlying causes.

Let’s assume that we are investigating whether painting a house prior to listing it for sale impacts the time-to-sell. We can hypothesize that having more income may affect whether or not owners decide to paint the house prior to a sale. However, we can recognize that greater income means that the owner also has access to resources that could decrease the time-to-sell. This is an example of confounding bias and we should control for income in our final model.

Step 3: Assess Causality

Now that we have an outcome, causes and biases, and the mechanisms that constitute our dependencies, we can assess the causality. This final step requires us to verify if our hypothesized ideas are probable. Depending on the situation and the resources available to us, there are two ways we can achieve this: 1. by performing a randomized experiment and comparing the outcomes or 2. by developing a statistical model using historical data to measure causality.

Performing a randomized experiment with treatment and control groups can help us reduce bias by ensuring that the two (or more) groups in our experiment have similar representation of the population. If the groups are similar in their composition and our sample sizes are sufficient, we should be able to compare the outcomes between the groups and identify if the differences in outcomes are significant.

In our house sale example, we could sample two groups of home sellers (ensuring that both groups are equally representative of the home owner population). We could ask one of the groups to paint their home prior to listing it and we could ask the other group to leave their exterior paint untouched. Then, we would compare the distribution of time-to-sell between the two groups. Using a statistical test, we can see if there is a significant difference in the time-to-sell metric.

In practice, this would be difficult to achieve for many reasons, some of which include getting volunteer home owners to participate in our experiment, ensuring sufficient funding for the experiment, and ensuring that our samples are random and representative of the home selling population. However, if we cannot put together such an experiment, we still have options.

Building a statistical model using historical data can help us control for confounding causes and biases and estimate the impact of direct causes on our outcome. Using a technique like Regression, we can assign a weight to each cause and the generalized bias metric. We can estimate the parameters of our model (the weights in the model) by training the model using data available to us historically. The final result should help us understand the causal effect of variables on our final outcome. "Even if we can’t use randomized controlled trials to keep other factors equal between treated and untreated, regression can do this by including those same factors in the model, even if the data is not random!" [1]

However, independent of the technique we will choose to measure causality, it is important to note that our model cannot determine causality. We can include hundreds of features into a regression model but just because they are included and just because they have some weight associated with them does not guarantee that they are a cause of an outcome. Therefore, it is important to capture possible mechanism of causality in the graphical causal model so that we can avoid including irrelevant features and ensure that we obtain a sufficient result.

A Case Study

Let’s continue with the case study from Part 2, where I put together a strategy for answering the descriptive question about train lateness. Suppose that our decision maker now wants to know "why the trains are running late?" Following the steps outlined in this article, we can put together the following strategy for answering the question:

Identify the outcome. The outcome in the question "why the trains are running late" is the train lateness (which we defined as "a binary flag set to 1 if the difference between train actual and expected arrival times is greater than 1 minute")

Identify potential causes and biases.

  1. In order to identify potential causes, we can set up some interviews and brain-storming sessions with our decision-maker, we can observe trains on a platform and take a train ride, and we can talk to train conductors and passengers. Examples of potentials causes include delayed platform unloading and loading times, track construction, lack of dedicated tracks which cause train meet and pass delays, hazards (like leaves, ice, and snow), train age, and train technical issues. For each cause, we should also identify a mechanism by which the cause has an effect on the outcome.
  2. In order to identify potential biases, we can familiarize ourselves with the types of bias and assess if any of them apply to our use case. For example, selection bias may not necessarily cause an issue for us because we can include all trains in our study, not a select subset of the trains. On the other hand, we may have a case of survivorship bias since some train mechanical issues may cause the train to never arrive and it would, therefore, be excluded from the late trains dataset.
  3. In order to identify potential mechanisms of causality, we should identify how each potential cause could influence or impact the outcome. For example, a hazard (like leaves or snow) could cause the train to be late by making it slow down for the hazard. We can hypothesize that train age impacts train late-ness because older trains are slower. But is it true? Collecting relevant data and performing exploratory data analysis could help us verify if this mechanism of causality is legitimate.

We can put together a graphical causal model in order to assess our proposed causes and biases in relation to the outcome and in order to outline a potential mechanism for each cause. At this point, we can also perform some more exploratory data analysis in order to uncover hidden associations between our causes and select the final potential causes to include in our model. For example, if we find that trains that had a technical issue were mostly older trains, we don’t need to include train age as a model parameter since it is already implied through the technical issue parameter.

Assess Causality. Finally, we are ready to assess causality. Unfortunately for our case, it would be difficult and costly to conduct a series of experiments to test each potential cause. However, since we have detailed records of train schedules, train issues, as well as weather and track conditions, we should aim to construct a regression model in order to verify probable causes. In our case, we can construct a logistic regression model using our probable causes in order to predict if the train was in fact late. After the model is trained, the weights associated with our model parameters should indicate an effect that each cause had on the outcome.

After selecting the causes with non-zero weights, we can present our findings to the decision maker and answer their original question, "why are the trains running late?"

A Few Final Notes

This article is lengthy but I hope it illuminates a complex topic and makes it simpler to approach. A few notes:

  • We may not be able to identify exact causes of past or current events. In most cases, we can identify most probable or likely causes. Therefore, a decision maker takes on a risk and should be made aware of this.
  • Graphical causal models can be a great communication tool with the decision makers and can help reveal associations among potential cases as well as helping identify biases
  • Without a mechanism of causality, a potential cause with non-zero association to the outcome may just be a coincidence.
  • As an analyst, it is important to employ critical thinking skills, especially when dealing with diagnostic questions. These questions can have a lot of twists and turns that can send you on a wrong path.

Thanks for reading! In my next post, I will do a deep dive of predictive questions so stay tuned and let me know your thoughts in the comments!

Sources

[1] https://matheusfacure.github.io/python-causality-handbook/01-Introduction-To-Causality.html

[2] https://www.metabase.com/blog/6-most-common-type-of-data-bias-in-data-analysis

[3] https://www.cantorsparadise.com/survivorship-bias-and-the-mathematician-who-helped-win-wwii-356b174defa6


Related Articles