The world’s leading publication for data science, AI, and ML professionals.

Effective Data Visualization

Tips for building effective data visualizations condensed into 3 simple steps

Data has been growing at an exponential rate, due to various reasons such as our smartphone addictions. I don’t mean to guilt trip anybody, since I am just as guilty – the thing is… Digitalization has connected humans across the globe like never before. We depend heavily on our smartphones to keep us informed – if it was not for Twitter, I would not know that Thursday’s at 8pm is NHS appreciation day, hence why we clap for 1 minute—as a result data is mass produced, daily, hence why the 21st is being heavily tipped as the Data era.

"Data is the oil of the 21st century"

As the volume data grows, it becomes excruciatingly challenging to identify key insights from the increasingly larger batches of data, whether on an excel spreadsheet, in text or images. The human brain tries to identify patterns to help learn and store information, however in a large corpus of text(and other forms of data), finding key information can be extremely difficult without some visual data. The growth of Machine Learning, which allows systems to automatically learn and improve experience without explicit programming, has simplified the task of detecting patterns in data and we can create helpful visualizations of patterns from predictive analysis. Furthermore, before we begin to build predictive models or make decisions, understanding the data that we have at our disposal is vital, even beyond the fields of Data Science and Data Analysis.

On that note, in this article I am going to provide 3 tips for building visualizations that are effective at conveying the message that you are aiming to express. To do this I use the House Prices dataset from Kaggle which you can download from the Data section in the House Prices competition (Click here). For access to the code that I used to generate the figures, you can access that on my Github (Link below).

kurtispykes/demo

Figure 1: Photo by NASA on Unsplash
Figure 1: Photo by NASA on Unsplash

What is Data Visualization?

Before we move on, it is important to understand what data visualization is and the purpose behind it if we want to build effective visualizations, so without further ado… Data visualization is the representation of data in a visual format. The purpose of visualizing data is to summarize and present data with easily understandable visualizations that highlight a key message (or messages) from the data, to the readers.

Within the above description of the purpose of data visualization, we can extract 2 significant points that if remembered during any data visualization task it would instantly make our visualizations better:

Point 1: The visualizations should summarize and present easily understandable information. The word "easily" is key here as it speaks nothing of the complexity of the visualization. This is important as some figures may need to be more complex in order to meet the purpose of being easily understandable. On the other hand, some visualizations would be much better if created in a simple way to meet the same purpose. Distinguishing between when to use a simple or complex visualization is something that you should be capable of doing by the end of this article.

Point 2: Data is to be presented! In some instances, it may just be for the creator of the visualization to see, for instance when working on a Machine Learning classification task on your own and you want to understand what type of errors your predictive model is making. But almost always we eventually end up presenting the our findings to an audience. Due to this, visualizations must be tailored to a specific audience meaning that insights derived are interesting to the reader or they will just get bored. Using the earlier example, presenting visualizations of the type of errors that your predictive model is making to the CEO is not effective and will probably help you lose trust amongst the C-suite. Instead, what would be better is to present visualizations that will allow for actionable steps to be taken to bring business value.

"There are over 99 ways to be distracted and interesting visualizations ain’t 1" – Nobody

Fundamentally, the reason that we do data visualization is to assist the readers in seeing patterns or trends in the data that is being analyzed. Henceforth, I will provide a simple step-by-step guide for producing effective visualization.

Tip#1 Be specific on a question that should be answered by the visualization

A visualization that distorts critical messages hinders the ability to make meaningful or accurate decisions. One way that this can occur is by having a busy graph that tries to answer many questions at any one time. By having a specific question that you want answered by your visualization, it prevents overkill and trying to do too much on one figure. Figure 2 is aiming to identify the number of missing variables within the dataset and uses a traffic light like system to highlight the volume of missing variables in any one feature. This plot can be further enhanced by adding a legend that will explain the meaning of each colour, for instance Red = values with > 75% missing data.

Figure 2: Table displaying features, the amount of missing values in each feature and the percentage that is relative the number of instances.
Figure 2: Table displaying features, the amount of missing values in each feature and the percentage that is relative the number of instances.

Tip#2 Choose the correct Charts

"A picture is worth a thousand words"

This will sound simple, but can be extremely difficult… The chart should reveal what you want to tell the reader. If you are trying to tell the reader that from the data collected the distribution of the target feature or dependant variable then plots such as histograms, violin plots, qq plots, etc are gets the job done effectively. The plot in figure 3 combines the 3 (Histogram, Violin plot and QQ plot) and gives us no doubt that the distribution of the data is positively skewed, which is important for us to know when it comes to selecting the models that we will use for our problem. To improve this plot, I would move the title to the center, add the _kernel density estimator_ to the histogram plot and decide to use different colours to bring out the points I am making more – Note that these improvements are only to make the chart more pretty, as it does a good enough job of informing us that the feature is skewed.

Figure 3: Distribution of the target variable
Figure 3: Distribution of the target variable

Tip#3 Emphasize the most important information

Using colour, size, scale, shapes and labels directs the attention of the reader to the key message that you want them to identify – Figure 2 does this quite well with the traffic light system. We want to reveal the most important information to our reader – in the case of Figure 2, the reader may be the machine learning engineer that will engineer the features and build the model. The traffic lights instantly draws our attention to what instances may prove a challenge and what may need to be removed from data. Significantly, the visualization allows the Machine learning engineer or Data Scientist to delay the difficult task of thinking because the visualizations instantly shows us what is wrong with our data and this simultaneously generates ideas that can be implemented to improve our data. The engineering used to do this does not need to be what is used for the final model, the point is that it helps us to make decisions that can be acted upon, i.e. the ideas generated can be used to design a quick baseline model.

Figure 4: A heatmap displaying the correlation of features.
Figure 4: A heatmap displaying the correlation of features.

Another example of emphasis on the important information can be seen in figure 4. The heatmap shows the correlation between feature and uses brighter colours to emphasize on stronger relationships two features. This makes it easy to pick out correlated features and do further analysis on . Information as such is beneficial to the rest of the team in a Data Science project, as it will help determine what kind of approach is feasible for this task. To improve this table we can put a mask on the plot below (or above) the x=y line because it is redundant data.


Though I have not emphasised on tools that can be used in this article, in some ways, the tool that we use can aid us in achieving our purpose with data visualization. Data Scientist will know of tools, such as Matplotlib and Seaborn, but in this post the tool I used to generate my figures was Ploty.py. This framework allows for interactive charts, which makes our visualizations that much more engaging for the reader. A good article to learn more about this framework is by Will Koehrsen titled, The Next Level of Data Visualization in Python. In this post, he mentions his reasoning from switching from traditional visualization tools to modern tools and gives nice demonstrations for getting started.

Conclusion

The pivotal factor that should be remembered when visualizing data is that data visualization is to summarize and present data in an easily understandable manner. Therefore, we want to ensure that we are answering a specific question with our visualizations, making the right choice of selection when deciding what charts we are going to use to display our data. Last but not least, we want to make sure we emphasise on the key message that we want our reader to grasp rather than confusing them with superfluous information.

There are plenty of ways to visualize data and the best way to improve this skill is with practice. The Github code that I used to generate the figures for this article is far from optimal, under-utilizes the capabilities of Plotly.js and lacks depth – there is a lot more that can be explored in this data. Therefore, there is a great opportunity for you to practice (and i’ve given you a slight head start) – simply fork my work (Link to Github below), download the data from Kaggle (Click here to Download), install the requirements and then get to work.

kurtispykes/demo

If you want to share how you have used this article to improve my work (or work that you have done), which would be truly exciting for me, or you’d just like reach me about anything to do with Data Science (i.e. Possibly something to write about next), you can reach me on Linkedin @KurtisPykes, or simply leave a comment on this article. Additionally, I am super interested to hear your feedback on this article, so do not hesitate to get in contact with me!

Thank you so much for your time!

Other Resources useful resources on this topic…

Suraj ThatteTips for effective Data Visualization

Jonathon LauEffective Data Visualization for other Humans

Georgin Lau and Lei Pan, PhD – A 5-step guide to data visualization

Musum Rumi – A Detailed Regression with House-Pricing

Pedro Marcelino – Comprehensive data exploration with Python


Related Articles