This is part of a series on Strategic Data Analysis.
Strategic Data Analysis (Part 1) → Strategic Data Analysis (Part 2): Descriptive Questions Strategic Data Analysis (Part 3): Diagnostic Questions (Part 3) Strategic Data Analysis (Part 4): Predictive Questions ← Coming soon! Strategic Data Analysis (Part 5): Prescriptive Questions ← Coming soon!
In Part 1, I discussed the four types of questions that data analysts attempt to answer and ways to identify each question type. If you recall, when we ask descriptive questions, we attempt to acquire an understanding of something. These questions generally start with "what/is/does" and pertain to the current or past tense. Now, let’s dive into detail of the strategy on how to answer these questions.
Strategy for Answering Descriptive Questions
Descriptive questions tend to come up the most for data analysts and their answers tend to provide a foundation for follow up questions. Typically, seasoned analysts already have a strategy (or at least some guidelines) that they use in order to answer descriptive questions. A more specific strategy differs based on the question, industry, personal preferences and knowledge, etc. However, the skeleton for any strategy should include the following:
- Assessing the intent of the question
- Identifying the variables in question
- Defining the analytical goal of the question
These steps should guide you in choosing the best methodology and providing the most appropriate answer. Let’s take a deeper look.
Step 1: Assess the intent of the question
Before applying any technique to answer the question posited by the decision-maker, we must first understand why the question is being asked. This can significantly influence our strategy and the final approach that we choose. Some of the considerations within the intent include:
- How the answer will be interpreted,
- What decisions our answer will inform, and
- Our audience’s technical or statistical literacy
One of my favorite examples of intent awareness (and the one I share the most with my peers) is an article written by Tyler Buffington, PhD: Mean or median? Choose based on the decision, not the distribution. In this stellar review of choosing the right methodology, Tyler argues that the skewness of the distribution should not constitute the choice between mean or median as a metric for "the average." Instead, an analyst should focus on how this metric will be used for inference by the decision-maker.
The intent of the question can also lead us to choose the correct data points. Let’s take a look at the example: "what were our sales during the second quarter of this year?" Our answer can be either the sum of gross sales (count of units sold times the price per unit) or net sales (gross sales minus discounts and promotions). In some situations, our decision maker may not know this difference so either educating them or getting clarity around how this value will be used should inform us regarding which value to use.
Another consideration, which is a part of the intent, is the audience. If we are trying to answer a question that calls on us to compare distributions among groups, it may not be wise to show a complex visualization like a box plot to a decision-maker who does not know how to read box plots. A simple statistic may be the best alternative, especially to business partners who make hundreds of decisions per day and don’t have the time to review a complex chart (like C-suite executives, for example). On the other hand, if we want to present information to statistically-literate data scientists, a box plot may just be the way to go.
Step 2: Identify the variables in question
The next step is to identify and clarify the variables in the question that we want to describe in some way and to ensure that those variables have representative data.
For example, in "what were our sales during the second quarter of this year?" the single variable is obvious – it is the sales during the second quarter of this year and we can easily obtain the data from the sales ledger.
However, if the question lacks obvious variables, then the question should be restated such that it pertains to variables that are clear and can be represented with data.
For example, the variable in "is there any gender bias in our clinical patient care?" is "gender bias" but "gender bias" is not necessarily a data point per se. However, "difference in outcomes among genders" or "patient satisfaction among genders" are potential candidates for measures of "gender bias." So, we can restate our question as "is there any difference in patient outcomes among different genders in our clinical patient care?"
It is also important to look through question complexities for clarity. Some questions may feature several nouns but ask us to find a specific variable and we should isolate this variable from the question.
For example "tourists from which city tend to stay at our hotel longer?" includes tourists, cities, and hotels but the variable we are looking for is the origin city of tourists. For the question: "was there any change in hold time after we hired more call center representatives?" the two variables are: 1. the time series (to help us infer information before and after the change) and 2. the amount of time the customer was on hold.
Step 3: Define the analytical goal of the question
Having identified the variables in our question, we can now categorize the goal of the question. This can be achieved by rephrasing it into a directive and categorizing that directive. Identifying the goal can help us narrow down some appropriate quantitative techniques so that we can answer the original question.
Keep in mind: the analytical goal and the intent of the question are different. The intent of the question identifies what the decision-maker plans to do with the answer or how they plan to interpret the analytical results. The analytical goal of the question determines what we want to do with our variables once we identify them.
There are three types of goals that the descriptive question may be looking to achieve and these goals depend on the variables we identified earlier:
-
Describe a variableIf the goal of the question is to describe a single variable, then the answer will require us to find some parameter or a set of parameters that describe our subject. If we can restate our question using the keyword "find" followed by the subject of our question, then the goal of the question is to describe the variable. For example: "what were our sales during the second quarter of this year?" has a goal of getting a value to represent all of the sales; therefore, it asks us to find the sum of sales. As a directive, we can restate the question as "find the sum of sales during the second quarter of this year". Most techniques that can be used to answer these questions include calculating descriptive Statistics (like sum, mean, mode, ranges, etc) or visualization tools (like histograms or kernel density estimation plots). However, more advanced techniques exist, depending on the nature of the question.
-
Compare groups or variablesIf the goal of the question is to compare groups within a variable or to compare different variables, then our question can be rephrased using the "compare" keyword. These questions can also include comparison in time, which may require us to create a variable from the time series to serve as time categories (like groups of time represented by "before/after", hour, month, etc). In the example "is there any gender bias in our clinical patient care?", the question aims to compare patient care between the gender groups and can also be rephrased as a directive: "compare clinical patient care among all genders." There are many techniques that can help with comparing groups or variables. Visualization tools like bar charts or pie charts can assist with comparing groups, histograms and density plots can help compare distributions of values between two variables, line charts can help with comparisons of values in time, and scatterplots can help compare individual points. Descriptive statistics and statistical comparison tests (like t-tests or ANOVA) can be employed to compare two or more distributions [1].
- Identify trends or relationshipsIf the goal of the question is to identify patterns in a series (like time) or patterns among two or more variables, then we can rephrase a descriptive question into a directive using the keyword "identify a connection / correlation." It’s important to note that relationships do not imply causation but simply try to establish a connection between variables; causation is addressed in diagnostic questions. For example: "how did our revenue change this year?" aims to identify a trend of revenue in time. We can rephrase this into a directive: "identify a connection between revenue and time." The question "are air temperature and temperature of sea water related" aims to find a relationship between the two temperatures. We can rephrase this as "identify a correlation between air and sea temperatures." For identifying relationships between variables, scatter plots, bubble plots, and heat maps can assist visually while statistical methods like Pearson or Spearman correlation can help identify if the variables share a connection. Identifying trends in time / series is best achieved visually using line charts and statistical methods like ARIMA.
A Case Study
Let’s take a look at a question from Part 1: "do the trains run late?" In order to find the correct and effective technique to answer this question, let’s follow through the strategy steps outlined above.
Assess the Intent: Suppose that this question came from the VP of the train operations company. Through a conversation with her, we found out that the VP wants to know if any action should be taken to mediate the current train schedule if the train are in fact running late. If the trains do not actually run late, she also wants to establish the lateness as a KPI metric and continue to monitor it. Furthermore, the VP told us that she considers the "trains to run late" if most of the trains run late by over a minute.
Identify the variables: The identity of interest in the question "do the trains run late" is the train lateness, but what is the correct variable or variables that can represent this identity? Through analysis of the question as well as the intent, we can determine several options for our variables selection:
- Two variables: train expected arrival time and train actual arrival time
- One variable: difference between train actual and expected arrival times
- One variable: a binary flag set to 1 if the difference between train actual and expected arrival times is greater than 1 minute
Our variable selection should depend on the intent of the question and will definitely influence how we identify the goal of the question. From the intent, we know that the VP considers the trains to run late if most of the trains run late. So really – we only need a binary flag identifying if each train is, in fact, late. This is the simplest bit of information we can provide that will help us understand the overall train lateness and can help our decision-maker determine her next step.
Define the Analytical Goal: Given that we have identified the intent and the variable in question, we can now define the analytical goal and select a technique. Since we are working with a single variable, a binary "late train" flag, we know that the goal of the question is to describe the variable. The intent of the question was to identify if most trains are late. Therefore, one of the techniques we can choose is to calculate % of all trains which ran late to determine if > 50% of them were late. We can relay the final information to our VP so she can decide on what to do next.
This strategy would differ significantly if the intent of the question or the audience was different. If our decision maker wanted to understand the distribution of train lateness, we should have chosen to select the difference in train actual and expected arrival times and select a visual technique like a histogram to communicate the distribution of train lateness.
A Few Final Notes
You are welcome to use the strategy described above in the way that suites you but here are a few tips for making it work for you:
- Keep things simple and work up toward complexity as necessary.
- The strategic process should come intuitively but it’s never a bad idea to write down the intent, variables, and goal so that you have clarity around the task or develop discipline in your approach.
- Be flexible – your strategy may change or even evolve over time. This document is a good start but don’t let this limit your creativity and thinking.
- Don’t forget to analyze! Some questions are not as intuitive as others and require us to think and perform analysis to understand and find the best answer.
Thanks for reading! In my next post, I will do a deep dive of diagnostic questions so stay tuned and let me know your thoughts in the comments!
Sources: