How to properly tell a story with data — and common pitfalls to avoid.

Outlier AI
Towards Data Science
8 min readMay 29, 2017

--

One of the most important aspects of working with data, but least appreciated, is how you communicate what the data tells you. Even if you follow all of our advice here on the Data Driven Daily, if you communicate your insights poorly or mislead the audience it can be worse than not using data at all.

Some of the worst business decisions in history were data driven! When the rise of Pepsi was threatening Coke’s dominance in the soft drink category, Coke took a data driven approach to the problem. They formulated New Coke and found in taste tests across the country that it beat Pepsi hands down. Armed with conclusive data, the company bet their future on New Coke.

Unfortunately, part of the story of New Coke was missing. New Coke was extremely sweet, so it would easily win a taste test where the customer drank a single sip, but was distasteful in a full cup or bottle. Additionally, the testers failed to measure the emotional attachment customers had with the old flavor of Coke, which they had grown up with as children. Without that context for the test data, the fact that customers did not like New Coke was lost on the decision makers. After a disastrous launch the company had to desperately pivot back to their original formula. [1]

In this article I’ll show how storytelling can affect how people view the same data in different ways. We will tell stories about a single set of data in 3 different ways and which imply three different conclusions. Here is the data we will use, which is monthly revenue data for a company with four products:

There are no obvious conclusions to draw from the raw data, we’ll need to do our work as data experts to explore the data and tell the stories it is hiding. What stories will we tell about this data? Specifically we will cover:

  • Part 1 — Stories that Lie
  • Part 2 — Stories that Mislead
  • Part 3 — Stories that Inform
  • Part 4 — General Storytelling

Outlier monitors your business data and notifies you when unexpected changes occur. We help Marketing/Growth & Product teams drive more value from their business data. Schedule a demo today.

- Outlier was the Strata+Hadoop World 2017 Audience Award Winner.

Part 1. Stories that Lie

First, I’m going to tell you a story about our data (provided above) that is great news by purposefully lying.

Revenue is growing quickly, with December jumping significantly over November. You can see this in our revenue chart below:

This growth was led by a 1,150% jump in revenue for Product D! If the trend continues we would expect overall revenue to exceed $100,000 in January for the first time. This is an exciting time for the business, and this growth is a direct result of the strategic decisions I made in June of last year.

Why is this Story Good?

It’s not, it’s a lie.

Why is this Story Bad?

I’ve used some very common tricks in data storytelling to trick you into reading more into it than is really there. Specifically:

  • Manipulating Scale. When looking at a chart, you are used to comparing the size of bars to determine relative size. By having the y-axis start at an arbitrary value, instead of zero, I changed the relative size to make it seem like there is significant growth where there isn’t.
  • Cherry picking. I’ve selected only a few data points that make my case (growth rate of Product D), instead of considering the full breadth of the data and what it really conveys.
  • Jumping to conclusions. I’ve jumped to conclusions based on the data despite there being no clear logical basis. Why would the decisions I made last June have affected revenue in December?

In this case I was being malicious and purposefully lying.

Part 2. Stories that Mislead

Next, I’m going to tell you a story about our data that is true but misleading:

Overall, revenue for all products has been fairly even. There is a notable exception of a big jump in revenue for Product D in December, which jumped over 1,000% month over month.

However, a single product’s revenue dominates all others — Product C comprises more than 50% of the total revenue. This means that the revenue from Product C is larger than all three other products combined.

Why is this Story Good?

This story is true, which is a good start. By using a relative scale in the first chart, the significant deviation of Product D becomes very clear. The story attempts to break down the overall trend with some insight into the composition of products, but falls short.

Why is this Story Bad?

It is very misleading. I’ve made some commons mistakes in communicating the data which make it easy for you, as the audience, to reach inaccurate conclusions.

  • Lack of Clarity. I haven’t labeled the axes of my charts or in my discussion! In the first chart the y-axis is growth rate while the pie chart is revenue in December. However, since there is no way for you to know that you can jump to all sorts of different conclusions.
  • Confusing Charts. The two charts switch which colors represent which products. Yellow represents Product D in the first chart and Product C in the second chart! If you read this quickly, you might think the fast growing product is the largest, which is not true.
  • No Context. I mention that Product D grew by over 1,000%, but not what the absolute value is! For example, a 1,000% growth from $1 is $11, a change of $10, but a 1,000% growth from $100 is $1,100, change of $1,000. Avoid using relative measurements without providing the proper context.

Finally, I used a pie chart which many people in data visualization think is a big mistake.

Part 3. Great stories

Let’s finally tell the true story of our data, and in doing so highlight how bad these past two stories really were.

Overall, revenue is trending downwards over the past year (dotted line). There is a clear peak in the summer and dips in the spring and fall, and based on some prior years of data the expected range (green bar) reinforces this as a seasonal trend. Below is a chart of overall revenue that shows this trend:

As you can see, April and May were slightly anomalous months in this trend, but overall the pattern of revenue is very consistent.

Hidden under this overall trend is a shift in revenue composition. Product A has been declining over the course of the year, while Product C has been increasing. In December, a significant jump in Product C revenue resulted in the first month where Product C revenue was higher than Product A.

If this trend continues, we can expect revenue to increase into the following year.

Why is this Story Good?

This story follows a lot of the best practices of data storytelling:

  • Start with the big picture. Frame all of your data communication within a bigger picture. In this case, the story first states that overall revenue is trending downward and looks seasonal.
  • Show context. By highlighting the expected range and linear trend of the overall revenue in the first chart, it is easier to grasp the larger pattern hidden in the data. These visual guides are important context for interpreting the raw data.
  • Highlight important drivers. By identifying the two important drivers of overall revenue (Products A and C) and describing their relationship, an insight that was otherwise hidden by the top chart becomes clear. This is different from cherry picking data because we are highlighting meaningful and important drivers of the data.

Most importantly, in this case I let the data tell the story instead of trying to tell a story with the data. This is an important point because if you try to fit your data to an existing story you are going to fail to tell the entire story.

Why is this Story Bad?

It’s not, although I could have gone into more detail. Overall I’m pretty happy with it.

Like most honest and straightforward data stories, the conclusion is not nearly as clear as when we were lying or misleading. This is because real world data rarely gives us a single, clear message. You will need to rely on your expertise and your audience’s judgment to reach conclusions.

Part 4. General Storytelling

We’ve reviewed good stories and bad stories, all about the same data. Since you will be telling stories about your own data, let’s review the best practices around communicating your own data insights.

Do…

  • Start with the big picture. Frame all of your data communication within a bigger picture. This makes everything easier for your audience to understand and gives you a strong start to your story.
  • Show context. The more context you provide, both visual and verbal, the less likely your audience will be to jump to mistaken conclusions.
  • Highlight hidden insights. Many of the most important insights in your data will be hidden below the surface. Highlight these and contrast them with the overall metrics to make a powerful story.

Don’t…

  • Manipulate Scale. Be clear about the scale of data and the units, and be weary of charts with multiple axes. It is better to over communicate context about your charts than have the audience misunderstand.
  • Cherry pick data. Include the full breadth of data in your communications, not just the data points that help make your case.
  • Be inconsistent. Use the same colors, labels and conventions across all the statements and visualizations in your story. Doing so creates a natural language for your data, one that your audience can learn.
  • Lie. Seriously, just don’t.

Overall, let the data tell the story. Avoid trying to use data to justify an existing decision or tell an existing story, as those paths are bound to lead to mistakes and errors. If you let the data lead you, the story should tell itself.

Outlier monitors your business data and notifies you when unexpected changes occur. We help Marketing/Growth & Product teams drive more value from their business data. Schedule a demo today.

- Outlier was the Strata+Hadoop World 2017 Audience Award Winner.

[1] Snopes has a great breakdown of the New Coke disaster if you’d like to read more.

--

--

Outlier discovers unexpected changes and patterns in your data automatically.