The world’s leading publication for data science, AI, and ML professionals.

The Achilles Heel of Scatter Plots

Visualizing large datasets with hidden trends using an alternative to scatter plots

Photo by Luke Chesser on Unsplash
Photo by Luke Chesser on Unsplash

Think about this statement: Any time you have x and y data, the easiest and most useful way to visualize it is in a scatter plot.

Is that true? False? Mostly true? What are situations where it’s not useful or even confusing? Does your plot convey the story or message that you’re trying to communicate without any ambiguity? These are some questions you want to ask when you make a data visualization.

In this article, I want to show you one of the neatest little tricks that I’ve learned. As a data scientist, you’re likely handling data constantly and in high volumes, and visualization becomes a key to communicating your findings. While a scatter plot is really good to show trends and correlations, the fact is that with more data, you get more outliers. With a scatter plot, every single point is represented equally; outliers show up just as clearly as points that contribute to the trend, and if you have enough, they can completely obstruct the important data.

As a data scientist, you may be thinking that the first option to clear things up is to filter everything through some ML algorithm and plot the results rather than the raw data. While that’s certainly useful, it isn’t conducive to efficient data exploration. Not only that, but getting an idea of what data you have is important to choosing the right ML model in the first place. Is it clustered, or is there some kind of trendline? And what type of clustering is it?

Let’s start with an example so we can really see the point that I’m trying to make. You can find the raw data on my Github, as well as code. Take the data from data.csvand load it into a dataframe. What do you notice? It has an x and y column, so our first thought for visualization is typically "use a scatter plot." Let’s go ahead and see what that looks like.

Scatter plot of raw data. Plot by author.
Scatter plot of raw data. Plot by author.

Now you’re probably thinking "that looks useless, time to move on." Thinking of data exploration in machine learning, would this look like a useful feature or combination of features for anything? Would you imagine using a clustering algorithm? My first thought is that it’s useless data with no correlation or grouping. That’s because scatter plots aren’t always the best way to visualize a 2-dimensional Dataset! I’m sure you’ve figured by now that there’s a secret correlation hidden in here somewhere. What if you could somehow highlight the trend without doing any kind of filtering?

First off, I want you to notice the size of the dataset. 473,111 datapoints is decently large, and you’ve probably seen larger. Even with .1% outliers, that’s still nearly 500 points of outlier data, all of which take up several pixels. However, if you have 100 datapoints all close together, their pixels overlap. Maybe you could blow this plot up to a larger screen, but that’s a prohibitive way to counter what turns out to be a fairly common problem.

What we want to do to filter out the outliers is cut the scatter plot up into a grid, and then count the number of datapoints that are in each square of the grid. Then we can map the count of datapoints in each square to a grayscale value or dot size. It would look roughly something like this:

Process outline for turning a scatter plot into gridded data. Image by author.
Process outline for turning a scatter plot into gridded data. Image by author.

Sounds like a lot of work, but there’s a very convenient type of plot to do this with. We’ll use the hist2d from Matplotlib, and start with a 10×10 grid.

2-D histogram plot of data, showing a much more interesting picture. Plot by author.
2-D histogram plot of data, showing a much more interesting picture. Plot by author.

Neat! Already we see a much clearer picture of something interesting happening in the data. Maybe this is enough to paint a picture of what’s going on…but in our case, there might be more. We can see if the trend clears up by increasing the number of bins. Let’s try 100:

2-D histogram plot with more bins, showing a much more complete picture. Plot by author
2-D histogram plot with more bins, showing a much more complete picture. Plot by author

That’s a clearer picture…literally. It may seem like a manufactured example with an actual picture, but you’ll be amazed at how often you’ll find ways to use this technique. Are you trying to plot stock prices of 100s of companies in a given industry over time, and it’s hard to see if there’s a trend? Or what about solar irradiance trends? Sunlight in a given day can vary wildly, but year over year, we’ll start to get a good idea of what’s normal and abnormal. All of these very real-world trends are deceptively messy if you put them in a regular scatter plot or line plot, but become quite clear and interesting if you use the binning method for large datasets.

Before I wrap up, just a quick warning: as your grid size approaches infinity, you’ll be right back to a useless plot where noise is just as significant as the trend, just as we saw in the scatter plot. When you use this method, be sure to try out several grid sizes. Also, I know of a few other ways you could accomplish the same thing, but I wanted to introduce this primarily to get you thinking outside the box of always using Scatter Plots.

I hope you find this as useful as I have. Now you know this trick, I’m sure you’ll find plenty of opportunity to use it, and you should be able to make much more impressive plots that paint a much clearer picture. I’d love to hear what techniques you use for clearer data visualization, and if you find other use cases. As always, feel free to connect on LinkedIn, or see my other articles on case studies and useful tricks I’ve learned. If you want to run this code on your own, or upload your own picture to turn into a plot, check out my Github repo.


Related Articles