Photo by Caleb Woods on Unsplash

What are the Ingredients of a Terrible Data Story?

How ignoring the grammar of graphics can destroy even legendary data visualizations

--

“Lions are mammals belonging to the cat family” “eating is the ingestion of food to provide an organism with energy” “giraffes are the tallest living animals”.

Does the above sentence make any sense? Writing about the eating habits of lions, I forced myself to use phrases or sentences, instead of the conventional way. The same sentence when composed naturally would come out to be:

“Lions eat giraffes”

Now that’s concise and makes a lot more sense. We are so used to this way of composing sentences in any language that we take it for granted in our day-to-day communication. But, why should data consumption be any different?

Why do we artificially impose constraints on communicating using monolithic charts? While presenting analytical insights, data science practitioners think templatized dashboards and readymade charts. Sales trends become bar charts in a ‘KPI section’ at the top, while product mix is a pie chart below, and so on.

This is just the same as forcing communication using preset sentences, rather than stringing together the right set of words, to convey an elegant message. Unfortunately, this practice is way too common today. Every time this happens, an information designer’s heart bleeds.

Not convinced yet?

Just in case you’re still wondering, let's try an interesting experiment. We’ll pick an elegant visual, widely considered to be an all-time best. We’ll then find out what would happen if we use a templatized approach to presenting the very same data.

Let’s subject ‘Napoleon’s Russian Campaign’ created by Charles Minard to this test. This is a classic, timeless visualization, considered by Edward Tufte to be the best visualization ever created. This graphic, hand-drawn in 1869 was composed by freely letting the data talk its own story, just the right way. With this trial, I hope Minard doesn’t turn in his grave!

I first picked the data provided below. This was recreated by Leland Wilkinson from the visual and is published in his book. There are three datasets below: the first one shows the cities that Napoleon’s army marched through, along with their lat-long.

The second set shows the longitude of places and their temperatures on the quoted dates. This highlights the harsh winter during the army’s retreat. The third dataset shows the soldier count at each lat-long, their direction of march (forward or retreat) and if it was the main group, or a splintered one.

Now, let's assume that this data was emailed to us from the campaign frontline. Our manager then tasked us to create a visualization overnight. An immediate impulse is to feed this into the corporate dashboarding assembly line. Data goes in from one end on a conveyer belt, robot-arms use preset molds of dashboard layouts and readymade charts to force-fit them. And out comes a shiny, interactive, colorful piece-of-junk from the other end. Well, almost.

I tried applying a crude templatization approach like the above, to “Napoleon’s March”. Let’s see how the same data loses all elegance and ends up with a corporate-style dashboard using pre-built charts. Here is the final result:

How we killed Charles Minard’s storytelling using the modern dashboarding assembly line

We’ve managed to destroy the narrative, and the legendary data story has been reduced to a petty, ineffective dashboard. I used Tableau public to put this together, but as a strong disclaimer, the issue is NOT with the tool. Tableau is a great tool and so are many others in the market. The issue is always with the methodology adopted and a creator’s unimaginative treatment.

A syntax for graphics?

Yes, graphics do have a syntax. It is possible to pick the right set of underlying elements to compose elegant visuals. We need not fumble with rigid, pre-built charts. When we embrace these key entities of information design, it endows us with the power to construct any visual.

Leaning on the excellent foundation established by Leland Wilkinson in his book, The Grammar of Graphics, we will understand the fluid construction of elegant graphics. Using a simple example, we will see how to build a superior visual with data elements, layer by layer. We’ll also prove that not all charts need to have standard names.

Grammar makes language expressive. A language that has words and no grammar expresses only as many ideas as there are words. — Leland Wilkinson

What works for English grammar?

For a quick context, let's look at how we intuitively construct sentences in the English language. John’s actions on the playground are communicated by bringing in the various parts of speech and stringing them together.

Any simple rewording can totally alter the structure. For instance, if we swapped the last two words with the first, the sentence turns into “The ball hit John”. A cosmetic change, but the result is not quite the same anymore!

Introducing the Grammar of Graphics

To make graphics or visual representations expressive, one must understand their underlying syntactical structure as well. The grammar of graphics provides a standard set of guidelines on converting data into effective visualizations that tell their story.

Let's assume we have the following data to be presented. It shows the sales performance across 6 cities in the US.

There are 7 components of grammatical elements in the grammar of graphics. Let’s look at each starting from the underlying ones, and move upwards. This concept is best illustrated with examples. We’ll use ggplot2, a high-level charting package on R, which was also inspired by the same book.

If you don’t understand code, don’t worry. The snippets of code shown below are only for illustrative purposes. One doesn’t need to know programming to follow. Just glance at the tags and see how the visual changes when each word is incrementally added. This needs no more than plain English understanding.

Components 1–2–3: Data — Aesthetics — Geometries

Data is the base component, with the elements to be plotted. Aesthetics provides the axes and encoding elements for data, while Geometries holds the shapes the can be used to represent the data.

Here is a simple command to plot the sales against price for each of the cities, using the 3 components shown above. Note how each is explicitly called out, data is mapped to the input data frame, aesthetics associates the columns to x-y axes, and geometry asks for showing the shapes as points.

ggplot(data, aes(x=Price, y=Sales)) + geom_point()

No, this is not a syntax to create a scatter plot. We’ll see how one can play with these 3 components by encoding more elements. Let’s now color the points by the region that the city belongs to (left plot). Then, we differentiate the cities by showing the sales volume as the point’s size (right plot). Note that there are just 2 additions to the command, as emphasized below.

ggplot(data, aes(x=Price, y=Sales, color=Region, size=Volume)) + geom_point()

Component 4: Facets

We now add a fourth component called ‘Facets’. As the name implies, this is used to facet out by creating subplots. At times it is helpful to split and compare plots side by side, to highlight the differences more clearly.

To the same command from above, we ask for the visual to be split apart based on ‘regions’, rather than showing everything in a single chart. This needs one simple addition as shown below:

ggplot(data, aes(x=Price, y=Sales, color=Region, size=Volume)) + geom_point() + facet_wrap(~Region)

Component 5: Statistics

The fifth component is ‘Statistics’, which provides a way to introduce statistical models and summaries such as mean, median, distributions. It's often useful to show the underlying statistics.

Let’s say we wanted to compute the average sales at each of the price points. We can dynamically add this by including just one more parameter to the command. This causes cities with the same price point to be aggregated.

ggplot(data, aes(x=Price, y=Sales)) + stat_summary_bin(fun.y = “mean”, geom = “bar”)

Component 6: Coordinates

We may have a need to change the coordinate system for plotting. Default cartesian coordinates or x-y plots shown above can be modified. We can change this to polar coordinates, which is the base for the (un)popular pie or donut charts.

One single addition to the command with an intuitive naming transforms the entire visual without having to modify any of the base components. Though this visual is not very appropriate for our data, this shows how it can be done. Is the below a variant of ‘spider or radar chart’ or ‘bubble on circular plot’? We’re already inventing representations!

ggplot(data, aes(x=Price, y=Sales, color=Region, size=Volume)) + geom_point() + facet_wrap(~Region) + coord_polar()

Component 7: Theme

The final component in the grammar is ‘Theme’ which can be used for any non-data ink. Examples include chart or axes title, labels, background-color schemes and the like. This layer is where stories can be annotated by blending in non-data ink along with the data-ink.

As with other components, adding a single parameter ‘theme_bw’ below transforms the foreground-background from the default grayscale earlier into a black-on-white theme. Equally easy ways exist to add title, labels, margins or lines.

ggplot(data, aes(x=Price, y=Sales, color=Region, size=Volume)) + geom_point() + theme_bw()

Thus, we’ve seen how a syntax for graphics can help seamlessly compose data onto the most appropriate elements. These components come together to form a layer, and a plot may have multiple layers weaved together.

If the intent is to compare sales of two products, plot them as the length aesthetic of the bar-shaped geometry. If you want to see how the growth of these products varies, bring this in as the width aesthetic. No, please don’t think bar charts yet! Instead, visualize how you can make the data shine.

Want to see the margins of these products as well? Encode them as the color aesthetic. Want to compare the products across companies? Facet the plot to split the view side-by-side, and compare the two companies easily. Before you share it with your users, add the necessary text using the theme component.

The issue with thinking up charts is that, as requirements are added, the thought process is invariably stalled. A mind working only with rigid charts soon runs dry of versatile representations.

Summary

Grammar of Graphics: A layered approach to elegant visuals

We’ve looked at the fundamental building blocks for a flexible presentation of data. The real power of this concept lies in uncaging your data from the confines of monolithic charts. Set them free and let them tell their own expressive story.

Though many visualization tools today don’t adopt a grammar of graphics approach in its entirety, that seems to be the way forward. Meanwhile, there are opportunities for people to start putting this to practice. This is so important that it must be made mandatory education for anyone working with data, whether it is data translators, analysts, designers, data scientists or journalists.

What’s your experience working with a fixed set of charts?

If you found this interesting, you will enjoy these related articles I wrote:

Passionate about data science? Feel free to add me on LinkedIn and subscribe to my Newsletter.

Article References:

  1. Book titled “Grammar of Graphics” by Leland Wilkinson
  2. Book titled “Semiology of Graphics: Diagrams, networks, maps” by Jacques Bertin
  3. The Wilke Lab classes on ggplot2: Lecture 3 - Spring 2017
  4. GGPlot2 package for interactive graphics in R
  5. Bret Victor’s Stanford HCI Seminar titled ‘Drawing Dynamic Visualisations’ on designing a system that uses data-driven visualizations.

--

--

Co-founder & Chief Decision Scientist @Gramener | TEDx Speaker | Contributor to Forbes, Entrepreneur | gkesari.com