One Bubble Chart, Comparing 10 Data Visualization Tools

Published in

Towards Data Science

7 min readOct 23, 2017

Introduction

For anyone wanting to learn data analysis and visualization, there is no shortage of “best tools” articles online advising you what to choose. I won’t attempt to create a list, as there are simply too many tools to enumerate. However, I do want to show you a number of tools and programming languages for data visualization that I have encountered, and let you compare. Most of them are easy to learn, highly flexible, and free or with a free version(This is how I choose what tools to learn). So you can get your hands dirty in no time. Let’s get started.

Let’s try to reproduce Hans Rosling’s famous bubble chart to tell the story of the wealth and health of nations. That is, a scatter plot diagram with a third, bubbly dimension. It lets you compare three variables at once. One is on the x-axis (GDP per Capita), one is on the y-axis (life expectancy), and the third is represented by area size of bubbles (population).

Data

The data comes from the latest World Bank Indicators. The download process is as follows:

library(WDI)
indicator2 <- WDI(country="all", indicator=c("NY.GDP.PCAP.CD", "SP.POP.TOTL", "SP.DYN.LE00.IN"), start=2015, end=2015, extra = TRUE)drops <- c("iso2c","iso3c", "capital", "longitude", "latitude", "income", "lending")
indicator2 <- indicator2[ , !(names(indicator2) %in% drops)]colnames(indicator2) <- c("country","year", "GDP_per_capita", "population_total", "life_expectancy", "region")indicator2 <- indicator2[-c(1, 2, 3, 4, 5, 6, 19, 66, 67, 159, 178, 179, 180, 181, 182, 201, 202, 203, 204, 205, 206, 207, 225, 226, 227, 228, 236, 237, 238, 239, 240, 241, 242, 243, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 262, 263), ]

That’s it, this is our data!

Visualization

Base R

One of the most powerful functions of R is its ability to produce a wide range of graphics to quickly and easily visualize data, with just a few commands.

The symbols() function draws symbols on a plot. We will plot circles using a specified set of x (GDP per Capita)and y (life expectancy) coordinates, and customize the size of the circle (square root of the area).

options(scipen=999)radius <- sqrt( indicator2$population_total/ pi )
symbols(indicator2$GDP_per_capita, indicator2$life_expectancy, circles=radius, inches=0.35, fg="white", bg="red", xlab="GDP per Capita(USD)", ylab="Life Expectancy(years)", main = "GDP per Capita vs. Life Expectancy 2015", ylim = c(0, 100))

Gives this plot:

2. ggplot2

ggplot2 has become the most popular plotting package in the R community in recent years. It allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner. Groupings can be represented by colors, symbols, size, and transparency, but be prepared for a steep learning curve.

library(ggplot2)
library(ggthemes)
ggplot(indicator2, aes(x = GDP_per_capita, y = life_expectancy, size = population_total/1000000, fill = region)) +
  geom_point(shape = 21) +
  ggtitle("GDP per Capita vs. Life Expectancy") +
  labs(x = "GDP per Capita(USD)", y = "Life Expectancy(years)") +
  scale_y_continuous(limits = c(0, 90)) +
  scale_size(range = c(1, 10)) +
  labs(size = "Population(Millions)", fill = "Region") +
  theme_classic()

Gives this plot:

One of the many great things about ggplot2 compared with base R, is that we don’t get lost in mapping legends, since ggplot2 generates them for us.

3. ggvis

The ggvis package is used to make interactive data visualizations. It combines shiny’s reactive programming model and dplyr’s grammar of data transformation, making it a useful tool for data scientists.

library(ggvis)
indicator2 %>%
  ggvis(~GDP_per_capita, ~life_expectancy, fill=~factor(region)) %>%
  layer_points(size= ~population_total/1000000,opacity:=0.6) %>%
  add_legend(scales = "size", properties = legend_props(legend = list(y = 200))) %>%
  scale_numeric("y", domain = c(0, 90), nice = FALSE) %>%
  add_axis("x", title = "GDP per Capita(USD)") %>%
  add_axis("x", orient = "top", ticks = 0, title = "GDP per Capita vs. Life Expectancy 2015",
           properties = axis_props(
             axis = list(stroke = "white"),
             labels = list(fontSize = 0)))

Gives this plot:

Unlike ggplot2, by default, ggvis will not combine scales based on the same underlying variables, into a single legend or multiple legends. Instead we must manually set the legends.

4. Plotly for R

R package plotly is a high-level interface to the open source JavaScript graphing library plotly.js. With Plotly, R users can easily create interactive, publication-quality graphs online using just a few lines of code.

library(plotly)p <- plot_ly(indicator2, x = ~GDP_per_capita, y = ~life_expectancy,
  color = ~region, size = ~population_total) %>%
  layout(yaxis = list(range = c(0, 90)))
p

Gives this plot:

5. Default Pandas plot

Pandas’ builtin-plotting DataFrame.plot() has various chart types available (line, scatter, hist, etc). It is a useful exploratory tool for quick throwaway plots if you are comfortable with pandas.

import pandas as pd
indicator2 = pd.read_csv('indicator2.csv')indicator2.plot(kind='scatter', x='GDP_per_capita', y='life_expectancy', s = indicator2['population_total']/1000000, 
                title = 'GDP per Capita vs. Life Expectancy 2015', ylim=(0,90))
plt.savefig('base_pandas.png')

Gives this plot:

6. Matplotlib

Matplotlib is a Python plotting library which produces publication quality figures. However, it can be a frustrating tool for a new user because it takes a great deal of work to get reasonable looking graphs. You will see what I mean.

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import numpy as np
indicator2 = pd.read_csv('indicator2.csv')plt.figure(figsize=(10,8))
colors = ("red", "green", "blue", "yellow", "orange", "black", "gray")def attribute_color(region):
    colors = {
        'East Asia & Pacific (all income levels)':'red',
        'Europe & Central Asia (all income levels)':'green',
        'Latin America & Caribbean (all income levels)':'blue',
        'Middle East & North Africa (all income levels)':'yellow',
        'North America':'orange', 
        'South Asia':'black', 
        'Sub-Saharan Africa (all income levels)':'gray'
    }
    return colors.get(region, 'white')
color_region = list()
qty_states = len(indicator2['region'])
 
for state in range(qty_states):
    color_region.append(attribute_color(indicator2['region'][state]))
plt.scatter(x = indicator2['GDP_per_capita'],
            y = indicator2['life_expectancy'],
            s = indicator2['population_total']/1000000,
            c = color_region,
            alpha = 0.6)
plt.title('GDP per Capita vs. Life Expectancy 2015', fontsize=20)
plt.xlabel('GDP per Capita(USD)')
plt.ylabel('Life Expectancy(years)')
plt.ylim(0, 100)regions = ['East Asia & Pacific (all income levels)', 'Europe & Central Asia (all income levels)', 'Latin America & Caribbean (all income levels)', 
           'Middle East & North Africa (all income levels)', 'North America', 'South Asia', 'Sub-Saharan Africa (all income levels)']legend1_line2d = list()
for step in range(len(colors)):
    legend1_line2d.append(mlines.Line2D([0], [0],
                                        linestyle='none',
                                        marker='o',
                                        alpha=0.6,
                                        markersize=6,
                                        markerfacecolor=colors[step]))
legend1 = plt.legend(legend1_line2d,
                     regions,
                     numpoints=1,
                     fontsize=10,
                     loc='lower right',
                     shadow=True)legend2_line2d = list()
legend2_line2d.append(mlines.Line2D([0], [0],
                                    linestyle='none',
                                    marker='o',
                                    alpha=0.6,
                                    markersize=np.sqrt(10),
                                    markerfacecolor='#D3D3D3'))
legend2_line2d.append(mlines.Line2D([0], [0],
                                    linestyle='none',
                                    marker='o',
                                    alpha=0.6,
                                    markersize=np.sqrt(100),
                                    markerfacecolor='#D3D3D3'))
legend2_line2d.append(mlines.Line2D([0], [0],
                                    linestyle='none',
                                    marker='o',
                                    alpha=0.6,
                                    markersize=np.sqrt(1000),
                                    markerfacecolor='#D3D3D3'))
 
legend2 = plt.legend(legend2_line2d,
                     ['1', '10', '100'],
                     title='Population (in million)',
                     numpoints=1,
                     fontsize=10,
                     loc='lower left',
                     frameon=False, 
                     labelspacing=3,
                     handlelength=5,
                     borderpad=4  
                    )
plt.gca().add_artist(legend1)
 
plt.setp(legend2.get_title(),fontsize=10)
plt.savefig('matplotlib.png')

Gives this plot:

As you can see, the plotting commands for Matplotlib are verbose, and obtaining a legend is cumbersome. It is probably too much work for most of us.

7. Seaborn

If you’re looking for a simple way to plot reasonable looking charts in Python, then you’ll love Seaborn.

import seaborn as sns
plt.figure(figsize=(10,8))
sns.set_context("notebook", font_scale=1.1)
sns.set_style("ticks")sizes = [10, 60, 90, 130, 200] 
marker_size = pd.cut(indicator2['population_total']/1000000, [0, 100, 200, 400, 600, 800], labels=sizes) 
sns.lmplot('GDP_per_capita', 'life_expectancy', data=indicator2, hue='region', fit_reg=False, scatter_kws={'s':marker_size})
plt.title('GPD_per Capita vs. Life Expectancy 2015')
plt.xlabel('GDP per Capita(USD)')
plt.ylabel('Life Expectancy(years)')
plt.ylim((0, 100))
plt.savefig('sns.png')

8. Tableau

Tableau has become so popular that many organizations require Tableau on your resume to even apply for their data analyst positions. The good news is that Tableau is extremely easy to learn. It’s such an intuitive tool that you can pick it up quickly. However, if you want to be good at it, you will need to practice, practice, practice, and dive into the thousands of Tableau workbooks that are out there (on Tableau Public) to study what others have done.

9. Power BI

Power BI is Microsoft’s entry into the self-service business intelligence (BI) space. It is quickly gaining popularity among professionals in data science as a cloud-based service, that helps them easily visualize and share insights, using their organizations’ data.

10. MicroStrategy

A few years ago, MicroStrategy launched a free data visualization software. Called “MicroStrategy Analytics Desktop”. It was designed to compete with other increasingly popular self-serve, data-discovery desktop visualization tools offered by Tableau and others. Let’s have a quick peek.

Conclusion

That’s it from me. If you want to know more about data visualization tools, check out Lisa Charlotte Rost’s What I Learned Recreating one chart using 24 tools.

The R and Python source code for all these charts can be found on GitHub. I would be pleased to receive feedback or questions on any of the above. Until then, enjoy visualizing!

One Bubble Chart, Comparing 10 Data Visualization Tools

Introduction

Data

Visualization

Written by Susan Li