The world’s leading publication for data science, AI, and ML professionals.

The Most Amazing Chart!

Why density charts make everything clearer

I wrote in a previous post how our attempt to interpret large amounts of data is a doomed effort: Ideally, we would want to explore the entire data set, comprehend every single data point, but that is an impossible undertaking of course. So we revert instead to summary descriptors (averages, percentiles, variance etc) or charts to synthesize and make sense of large data sets. I argued how this is always an imperfect endeavor – and quite often misleading in the case of averages: it’s like attempting to summarize a great literary work like War and Peace in a 300-word summary: it may give someone the illusion to someone that they have read the book, but of course, it’s just an illusion …

I – The density chart

There is however a particular chart that comes close to looking at the entire data set, almost like reading the entire novel without getting confused in the summary: density Charts. Indeed density charts are effectively showing ALL the data points, so there is no loss of information. While the naked eye can technically not see every individual point, the chart hints at where they are concentrated by using color shadings (e.g. warmer for higher concentration and cooler for more sparse data sets as per the below.)

This applies to any number of large data sets with a continuous measure whether it’s average sales per basket in a retail store, time spent on a website per user, amount of rain per day per county, # of books sold per title, etc.

Here’s an example of the chart illustrating a large number of calls to a call center. Each point represents the call time of a call to a call center.

Illustration of a density chart — each "point" represents a call so this chart effectively represents all the data points (Image by Author)
Illustration of a density chart — each "point" represents a call so this chart effectively represents all the data points (Image by Author)

I like to say that ANYTHING we look at is technically a distribution, and the beauty of density charts is that they show us the actual shape of the distribution. The color / density replaces the 3rd dimension (height) that would have made this an actual probability density chart.

How to create this chart:

This is easily created in a tool like Tableau by…

  • Choosing the "density" mark,
  • Making sure you are actually plotting all individual data points (individual calls in this case)
  • Adding the measure you want (call time in mins in this case)
  • Choosing an appropriate color scale (I like the "temperature" scale as I find its interpretation to be more universal)
  • Notice the log scale on the y axis. It helps visibility by spreading the dots: in large data sets with lots of outliers, there is a risk of "squishing" the majority of the data points on the bottom on a regular scale. The log scale allows you to spread the data and have much better visibility. It goes with a warning though: distances are not linear — a double distance is actually a 10x increase — so the untrained eye can be deceived

Note as well that the density chart is essentially a different spin on the histogram — or as the name suggests another way to represent the probability density function that can be seen below

Equivalent histogram showing the count of calls in each "bucket". I excluded the extreme values here (>50 mins) to avoid skewing the data (Image by Author)
Equivalent histogram showing the count of calls in each "bucket". I excluded the extreme values here (>50 mins) to avoid skewing the data (Image by Author)

I like to combine the first chart with a histogram as it gives me an idea of the volume represented by each color.

Creating a histogram in Tableau can be easily done by creating "bins" on the required measure (call time in this case) and counting the number of points in each bin.


II — Introducing descriptors

We can now of course add the required summary descriptors: Median, Average, Percentiles (from 10% to 90%), the min and the max. With this visualization in context, these descriptors are now a more powerful indicator of what’s happening than if they were given in a vacuum. It is interesting to notice:

  • Like all large data sets, outliers exist and are meaningful. The chart puts them in perspective as you can see where the 90% lie and where the min / max fall in comparison. Keep in mind that all data points belong to a distribution, and depending what distribution you have in your data, outliers may or may not have more influence (e.g. in power law / pareto style distributions characterized by their fat tails, extreme values have a significant inflence)
  • The median reflects how the "center" of the population behaves. Remember the important difference between average and median: averages are strongly impacted by outliers whereas median values are not — which explains the difference in the chart.
Adding descriptors to the density chart (Image by Author)
Adding descriptors to the density chart (Image by Author)

You can add these descriptors in Tableau by adding "Reference lines" on the axis. You need to add one reference line for the Median, one for the average and one for the percentiles.


III — Introducing dimensions

Now it gets more interesting when you introduce a dimension you want to compare the data against: this can be a time element (e.g. trend of call time per month) or any other category (e.g. are calls in Spanish longer than those in English) or by agent (do certain agents take longer than others?)

Adding a dimension for comparison (Image by Author)
Adding a dimension for comparison (Image by Author)

To add dimensions in Tableau, simply drag the required dimension to the columns.

Conclusion

Compare these density charts to the "poor" information provided by looking at averages alone

Image by Author
Image by Author

There is no indication whatsoever:

  • Of the number of data points involved (the above chart shows you that you have less observations in January and February)
  • The behavior of the extremes (July and argues are actually seeing an improvement for the bottom 10% — the averages are misleading
  • The average is very strongly affected by the high outliers … it reflects in no way the behavior of the "typical" / most frequent call — the median is a much better indicator

The density chart is more convenient than the histogram because it’s more condensed: imagine having to develop a histogram per month to compare the various months!

Use cases for this can be ANY distribution where you are tempted to use an average to compare options, or where you want to explore the effect of a particular input on the outcome you are measuring without running a regression model. Remember that one of the key advantages of this chart is that they quickly show you the outliers easily and can put averages and medians in perspective, something that models are not always great at. Here’s just a few examples:

  • Time it takes to process a task (customer call, website loading time, API call, manufacturing a widget etc). You can then compare the progress over time (months), over people (are certain people naturally faster than others?), over times of day (are we overloaded on certain times>)
  • In retail, average basket value per customer: in this instance each data point is for example the total basket value of a customer in a supermarket. You can compare against the known dimensions (time of day, type of customer, etc)
  • Sales people efficiency: each data point is the individual sales done by a sales rep for a particular customer. You can then compare sales reps against each other or customers against each other. You can quickly see for instance if the high value for a particular sales rep is driven by some outlier sales or not.

Related Articles