The Design of Statistical Graphics

A glimpse into the world of informative graphical beauties

Aman Saxena
Towards Data Science

--

Today we will learn about the human brain — the Occipital lobe is one of the four major lobes of the cerebral cortex in the mammalian brains.

No ?
Why are we learning biology in an article which is supposed to be about “Statistical Graphics” ?

Well, the visual processing center or the part of the cerebral cortex which processes and helps us interpret visual information is indirectly what we talk about when we discuss the depiction of complex ideas, uncanny fluctuations, upward and downward trends, undulated structures etc; with so much clarity, precision and orderly coherence.
These are exactly the adjectives of how graphical structures or statistical graphs, as we say it, should be.

The graphical reveal of data gives us a clear picture, presents so many ideas and numbers in a small space, helps us avoid the distortion of the idea the data seeks to present and eventually serve a reasonably clear purpose of description, exploration, tabulation and decoration.

Statistical graphics, just like the calculations, are only as good as what goes into them. A badly defined graph or a unfitting model or even an undernourished data set can not be rescued by a graphic, no matter how fancy or attractive it may be.

“We need to understand, learn and adopt the practice of graphical excellence.”

We need to understand, learn and adopt the practice of graphical excellence. In other words, using the correct graphs in the correct place at the correct time leading to the flawless communication of various quantitative ideas; that is what the motivation should be.

So to convey that idea and to understand the art of achieving graphical excellence and in turn appreciate the beauty with which these graphs are able to represent the data; we will learn and try to understand the following fundamental graphical designs and visualize and see just how good and fascinating graphical designs can be:

1. Data Maps
2. Time Series
3. Space Time Narrative Designs
4. Relational Graphics

Data Maps

Data maps are basically a combination of cartographic representation and statistical skills, which is widely used in today’s visualizations.

To understand data maps better we will take examples of two data maps and perform a little analysis on it. Each of these maps portray a huge amount of data which can only be visualized with the help of a picture. Carrying huge volumes of data in a concise space is what these data maps aim to achieve and do so successfully. Furthermore, these maps, as we will see, make the visual analysis much more easier than it could ever have been with that huge of a data set.

For the first example; this map represents the Death Rates due to strokes in adults with ages 35+ from 2000 to 2006 in the United States of America.
The color scheme represents the Age adjusted by Average Annual Deaths Rates per 100,000 people.

Source: www.cdc.gov

Now, we can clearly make out how the rate of deaths by stroke is extremely high in the south east and south of Amercia. The states of Alabama, Mississippi, Louisiana, Arkansas, Tennessee, Texas, South and North Carolina and the worst affected.
Whereas states like Wyoming in the west, Arizona and New Mexico in the south and New York, Vermont, New Hampshire, Rhode Island and Connecticut in the far east are the least affected.

Let us now look at another example where we will look at the division of America on the basis of the 32 NFL Teams.

Source: hermanhissjewelers.com

On simple visual analysis of the following map we can define and tell which state majorly follows which team. In terms of geographic reach, the Cowboys really are America’s Team. Their fan empire stretches from California to central Florida to southern Virginia to northern Montana. Steelers have another large fan empire that claims almost all of Pennsylvania and stretches across Appalachia and both Carolinas.
Now, the Texans have very less reach, their fan base only covers the Houston metro area. The rest of Texas goes for the Cowboys, and western Louisiana goes for the Saints. As far as Panthers are concerned, their fan base covers the core of western North Carolina around Charlotte, but are surrounded by Steelers fans on all sides.

The point of this quick analysis was to make you realize and notice that how quickly and accurately we can predict and direct our attention towards things or points or patterns (which won’t be visible otherwise) in a data set that are more important that others.

Apart from all the flaws data maps can possibly offer, there is no other method so powerful to display statistical information.

Time Series

Time series methods account for data points taken over time with some internal structure; such as some correlation or seasonal variation or a trend, with one dimension moving along the rhythm of seconds, minutes, hours, days, weeks, months, quarters or years. The strength and efficiency this graphical model gathers is from the natural order of time scale; which is used to make predictions or just to depict a timely trend.

For instance, let us look at a time series graphic which shows the New York City’s weather summary for 1980. This graph has 1,888 data points and we can see the daily high and low temperatures are shown in relation with a long term average. The path of the normal temperature also provides a forecast of expected change over the year. For example, in the middle of February the people of New York can look forward to warming at the rate of about 1.5 degrees per week all the way to July, the yearly peak. This distinguished graphic successfully organized a large collection of numbers, makes comparisons between different parts of the data, and tells a story.

Source: The Visual Display of Quantitative Information

Moreover, the passage of time is simply not the best explanatory variable, there are occasional exceptions when the data provides us with a clear mechanism that drives the Y variable. In such situations there is another approach we need to apply, and that is of smuggling additional variables into the graphic design and give a casual explanation.

Space Time Narrative Design

Let me introduce this as an effective and enhancing advancement to the time series displays. Including a spatial dimension to the design of the time series graphics and enabling the data over a space in multiple dimensions without even the viewer realizing that the graphic is multi dimensional, is what space time designs achieves.

We will try and understand this with the help of an example; the classic work of the French engineer, Charles Joseph Minard (1781–1870) who shows the terrible fate of Napoleon’s army in Russia in his brutal and eloquent design.

Source: Google Images

What we see here is an astounding combination of data map and time series portraying a sequence of losses suffered in Napoleon’s Russian campaign of 1812. There are six variables which are plotted: the size of the army, its location on a 2D surface, direction of the army’s movement(brown and black areas) and temperature on the various dates during the retreat from Moscow.

At the left we can see the thick flow line which represents the size of the Grand Army (422,000) as it invaded Russia. The width of this line indicates the size of the army at each place and instance on the map. We can see that in September when the army reached Moscow, the numbers were devastatingly smaller with almost 100,000 men only.

The path of Napoleon’s retreat is shown through the black line, which is linked to the temperature scale and dates at the bottom of the chart. On the march out of Russia, many soldiers froze to their deaths, hence the apparent thinning of the black line can be observed.

Towards the end of the black line we can see a division. That division portrays the crossing of the Berezina river. It was a complete disaster and we can see that in how suddenly the black line thins away. The army finally struggled back into Poland with only 10,000 men remaining.

Minard’s graphics tell us a rich and coherent story with its multivariate data, far more enlightening that a set of numbers or line chats showing the trends of the size of the army with time.

Relational Graphics

The birth of the idea of data graphics required replacing the latitude-longitude coordinates of the map with more abstract measures. For the various kinds of charts to emerge and exist, it required efforts and time as the transition was a big move. Analogies to the physical world led to the development of early time series. The physical analogies went on to drive the idea of graphical design for quite a while until the early 1800s, with the help of great works by Lambert and Playfair. This meant that a relationship could be made between any variable quantity with another variable quantity considering they had the same unit of measurement. Because of this latest realization and acceptance of the idea, data graphics, because they were relational and not tied to geographic or time coordinates, became relevant to all quantitative inquiry.

Today I lot of the published graphics are of the relational form, ranging from pie chats, line graphs, bar graphs, scatter plots (the greatest of all designs) etc.

Here for example, waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA. This chart suggests there are generally two types of eruptions: short-wait-short-duration, and long-wait-long-duration.

Source: Wikipedia

In another example we have the relationship between temperature and thermal conductivity of copper. The chart shows the relation between thermal conductivity and temperature from 4 K to about 300 K. The various curves represent various grades of pure copper rated by Residual Resistivity Ratio with RRR =2000 being the highest purity.

We observe that the thermal conductivity of copper increases with reduction in temperature; down to about 15–30 K. We can also see that the thermal conductivity change is relatively flat from 300 K down to 200 K with a slight increase.

So to sum it up, what is graphical excellence?
Something which gives the viewer the largest and the greatest number of ideas in the shortest time and the smallest space. Moreover, it also requires telling the truth about the data.

“As to the propriety and justness of representing sums of money, and time, by parts of space, tho’ very readily agreed by most men, yet a few seem to apprehend that there may possibly be some deception in it, of which they are not aware ….”
— William Playfair, The Commercial and Political Atlas(1786)

--

--