The world’s leading publication for data science, AI, and ML professionals.

Altair: Statistical Visualization Library for Python

A practical introductory guide

Photo by Isaac Smith on Unsplash [1].
Photo by Isaac Smith on Unsplash

Altair is a statistical visualization library for Python. Its syntax is clean and easy to understand as we will see in the examples. It is also very simple to create interactive visualizations with Altair.

Altair is highly flexible in terms of data transformations. We can apply many different kinds of transformations while creating a visualization. It makes the library even more efficient for exploratory data analysis.

What I think makes Altair special is the heavier statistical side than some other popular Python Data Visualization libraries such as Matplotlib and Seaborn.

In this article, we will create some basic plots to get familiar with the syntax and structure of Altair. We will also see how data transformations are implemented in the process of creating the plots.


Note: Here is a list of the articles in Altair series.


We start by importing Altair. If you are using Google Colab, it is already installed and can be imported directly. If not, you can easily install it using pip.

import altair as alt

We will be using an insurance dataset that you can obtain from Kaggle. We will read the dataset into a Pandas dataframe.

import numpy as np
import pandas as pd
insurance = pd.read_csv("/content/insurance.csv")
insurance.head()
(image by author)
(image by author)

The dataset contains some measures (i.e. features) about the customers of an insurance company and the amount that is charged for the insurance.


Scatter plot

A scatter plot is mainly used to visualize the relationship between two numerical variables.

(alt.
  Chart(insurance).
  mark_circle(size=40).
  encode(x='charges', y='bmi').
  properties(height=400, width=500))

I put each step in a separate line to emphasize the chain-like operations. We start by passing the data to a top-level Chart object. The data can be in the form of a Pandas dataframe or a URL string pointing to a json or csv file.

The second line describes the type of visualization (e.g. mark_circle, mark_line, and so on). The encode function specifies what to plot in the given dataframe. Thus, anything we write in the encode function must be linked to the dataframe. Finally, we specify certain properties of the plot using the properties function.

Here is the plot created with the code above.

(image by author)
(image by author)

We can make the plots more informative. For instance, we can use the color parameter in the encode function to separate data points based on a categorical variable. It is similar to the hue parameter of Seaborn.

We can also make the plot interactive just by adding the interactive function at the end.

(alt.
  Chart(insurance).
  mark_circle(size=50).
  encode(x='charges', y='bmi', color='smoker').
  properties(height=400, width=500).
  interactive())
(GIF by author)
(GIF by author)

It is possible to add more functionality to this plot. We can use the tooltip parameter to display additional variables when we hover on points. It is like the hover parameter of Seaborn.

(alt.
  Chart(insurance).
  mark_circle(size=50).
  encode(x='charges', y='bmi', color='smoker', tooltip=   
  ['age','sex']).
  properties(height=400, width=500).
  interactive())
(GIF by author)
(GIF by author)

Bar plot

Altair makes it simple and efficient to implement data transformations in the process of creating visualizations. For instance, we can create a bar plot that shows the average charges of each category in the region column.

(alt.
  Chart(insurance).
  mark_bar().
  encode(x='region', y='mean(charges):Q').
  properties(height=300, width=400))
(image by author)
(image by author)

We specify the transformation as a string (‘mean(charges):Q’) which is equivalent to the following syntax:

y=alt.X(field='charges', aggregate='mean', type='quantitative')

Let’s calculate the same averages using the groupby function of Pandas to confirm the results.

insurance[['region','charges']].groupby('region').mean()
(image by author)
(image by author)

The results are the same as expected. We implemented this calculation in the visualization.


Histogram

Histograms are mainly used to visualize the distribution of continuous variables. It divides the value range of continuous variables into discrete bins and shows how many values exist in each bin.

The following code will create a histogram of the bmi variable.

(alt.
  Chart(insurance).
  mark_bar().
  encode(alt.X('bmi:Q', bin=True), y='count()').
  properties(height=300, width=500))
(image by author)
(image by author)

We use the make_bar function with a data transformation step to create a histogram. Inside the encode function, we divide the value range of bmi variable into discrete bins and count the number of data points in each bin.


Grid of plots

It is extremely simple to create multiple plots in the same visualization.

We first need to assign the plots to variables which will then be used to combine plots or create a grid of plots.

p1 = (alt.
        Chart(insurance).
        mark_bar().
        encode(x='region', y='mean(charges):Q').
        properties(height=200, width=300))
p2 = (alt.
        Chart(insurance).
        mark_bar().
        encode(alt.X('bmi:Q', bin=True), y='count()').
        properties(height=200, width=300))

Once we have the variables, we can use the logical operators to combine them.

p1 | p2
(image by author)
(image by author)
p1 & p2
(image by author)
(image by author)

As you can see, it is just like a math operation to combine plots. The p1 + p2 syntax will combine the plots in the same figure but it is not appropriate in our case. If we had a line plot and a bar plot, it would be an option to consider.

We can create a grid of plots by combining several plots in this way. For instance, (p1 & p2) | (p3 & p4) creates a grid of 4 plots (2 rows and 2 columns).


Conclusion

This article can be considered as an introduction to Altair. There is much more this library offers.

What I like most about Altair is the ease and simplicity of data transformations. It facilitates the data analysis process as well.

In the second part of Altair series, I focus on how filtering and data transformations are used in the visualizations.

I will be writing more tutorials about Altair. Stay tuned for more advanced features of this library.

Thank you for reading. Please let me know if you have any feedback.


Related Articles