The world’s leading publication for data science, AI, and ML professionals.

7 Visualizations with Python to Handle Multivariate Categorical Data

Ideas for displaying complex categorical data in simple ways.

Photo by Kaizen Nguyễn on Unsplash
Photo by Kaizen Nguyễn on Unsplash

Common data, such as the well-known iris or penguin dataset, used for analysis are pretty simple since they have only a few categorical variables. By the way, real-world data can be more complex and contain more than two levels of categories.

Multivariate Categorical Data is a type of data that has numerous categories. For example, let’s think about grouping people. We may end up having many possibilities since a person can have different characteristics depending on categories, such as gender, nationality, salary range, or educational level. Vehicles also have diverse categorical variables such as brand, country of origin, fuel type, segments, etc.

Examples of visualization to display multivariate categorical data in this article. Images by Author.
Examples of visualization to display multivariate categorical data in this article. Images by Author.

Conducting the exploratory data analysis (EDA) using data visualization is recommended to help understand the data. Charts such as bar or pie charts are basic choices for plotting simple categorical data. By the way, displaying multivariate categorical data can be more complicated since there are many levels of categorical variables. Thus, this article will guide with charts that can express data with multiple levels of categories.


Getting data

Start with import libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

This article will work with a mock-up dataset containing 5 categories. The generated dataset contains grocery customer information: location, product, payment, gender, and age range. Each categorical variable can be generated using the random library, as shown in the code below.

The next step can be passed if you want to try the visualization code with other multivariate categorical datasets.

Let’s groupby the DataFrame to obtain the frequency of every category’s combination. After that, add the obtained result to the DataFrame.

Now that the DataFrame is ready, let’s continue to the visualization part.


Data Visualization

This article will cover 7 visualizations to display the multivariate categorical data. Each one will be explained with the concept, the Python code, and the obtained result.

Let’s get started…

1. Build a multilevel pie chart with a Sunburst chart

Basically, a sunburst chart is a multilevel pie chart. Due to being able to express multiple levels of data in one chart, this is a good option for displaying multivariate categorical data or hierarchical data. At the same level, the area of each item expresses its percentage compared with other items’ percentages.

One thing that is a limit using the sunburst chart is the density of annotation if there are too many categories in each level. By the way, this can be solved using a color scale to distinguish the values or creating an interactive sunburst chart that can be filtered.

We will use Plotly, which is a powerful Python library for creating Data Visualization. An advantage of using Plotly is that it helps create an interactive chart easily.

Ta-da!!

Using a sunburst chart to display multivariate categorical data. Image by Author.
Using a sunburst chart to display multivariate categorical data. Image by Author.

The following picture shows how the interactive function works.

Image by Author.
Image by Author.

2. Using multiple rectangle areas in a Treemap chart

By changing the plotting area from a circle to a rectangle, a treemap chart shares pretty much the same concept as the sunburst chart. This chart is a good option to maximize the plotting area since it can occupy more plotting space compared with the previous chart.

Plotly also provides a function that facilitates creating a treemap with interactive functions quickly.

Using a treemap chart to display multivariate categorical data. Image by Author.
Using a treemap chart to display multivariate categorical data. Image by Author.

Similar to the sunburst chart, it can be noticed the color scale helps us distinguish the frequency value.


3. Applying cartesian product and subplots with a Heatmap chart

Theoretically, a heatmap is a two-dimensional chart that uses color to represent data values. To apply the chart to show multilevel categories, which are five levels in this article, we need to use multiple subplots and the cartesian product of the categories. Please take into account that we need to leave two categories for comparing the values to plot the heatmap.

The itertools library can be used to generate a cartesian product list. The following code shows how to get the product from ‘location,’ ‘product,’ and ‘gender.’ Each heatmap will display the frequency of ‘age’ and ‘payment.’ The categories in cartesian product can be changed; please feel free to modify the code below.

import itertools
pair_loca_prod_gend = list(itertools.product(dict_loca.values(),
                                             dict_prod.values(),
                                             dict_gender.values()))
pair_loca_prod_gend

Apply the obtained cartesian product with subplots to show multiple heatmaps. We will use the heatmap function from Seaborn to plot the result.

Voila…!!

Using a heatmap to display multivariate categorical data. Image by Author.
Using a heatmap to display multivariate categorical data. Image by Author.

One thing to be considered is that using the heatmap charts to display multivariate categorical data has a limit of comparing only two dimensions of data while other categories are used to make the cartesian product.


4. Back to basic with a Clustered bar chart

Sharing the same concept as applying the heatmap chart, a clustered bar chart uses cartesian products and subplots to display multiple bar charts for comparing between categories. The bar chart is simpler and easier to understand since it is a basic chart many people are familiar with.

The code below is almost identical to the previous one, except we use Seaborn’s bar chart function instead of the heatmap function.

Using a clustered bar chart to display multivariate categorical data. Image by Author.
Using a clustered bar chart to display multivariate categorical data. Image by Author.

5. Stack the bars into a Clustered stacked bar chart

The bar charts can simply turn into stacked bar charts. The stacked bars are suitable for displaying the total of each category and the ratio of each stacked bar’s component.

By the way, please consider that the stacked bar chart can be misleading since the base of each component, except the lowest one, does not start from the same point. Thus, it can be challenging for people to interpret or compare between the stacked bar’s components.

The following code also applies cartesian products and subplots with Panda DataFrame’s plot function to build a clustered stacked bar chart.

Using a clustered stacked bar chart to display multivariate categorical data. Image by Author.
Using a clustered stacked bar chart to display multivariate categorical data. Image by Author.

6. Handel multiple dimensions with a Parallel coordinates diagram

A parallel coordinates diagram can show n-dimensional space by using multiple vertical axes. All these axes have equal lengths and are parallel with equal distances. An advantage of using this chart is that we can see the flow of data according to the order of categories.

The chart can be too dense to interpret if we directly plot every frequency value. Thus, before plotting, let’s use Panda’s cut function to group the frequencies in ranges.

f_range = pd.cut(x=df_m['freq'], bins=[0, 5, 10, 15, 20, 25])
df_m['freq_range'] = [str(i) for i in f_range] 
df_m.head()

The code below shows how to map the location with values for assigning color in the plot. Next, let’s build a parallel coordinates diagram with Plotly.

Using parallel coordinates diagram to display multivariate categorical data. Image by Author.
Using parallel coordinates diagram to display multivariate categorical data. Image by Author.

7. Showing parts-to-whole relationship with a Mosaic plot

This chart is also known as Marimekko chart or percent stacked bar plot. The idea behind the mosaic plot is to show a parts-to-whole relationship, the same as the treemap chart. In the result below, this chart looks like stacked bar charts with varying widths.

To quickly create a mosaic plot, we can use the mosaic function from the Statsmodels library. The function will calculate the frequency from the input categories. Thus, we can use the DataFrame with no frequency variables.

Please note that the maximum number of categories that can be plotted with this function is four, which can be considered a limitation using the mosaic plot.

Using a mosaic plot to display multivariate categorical data. Image by Author.
Using a mosaic plot to display multivariate categorical data. Image by Author.

Summary

First, let’s wrap up the 7 data visualizations that this article has covered:

  • Sunburst chart
  • Treemap chart
  • Heatmap chart
  • Clustered bar chart
  • Clustered stacked bar chart
  • Parallel coordinates diagram
  • Mosaic plot

If we take a closer look at every chart, it can be noticed that they all have something in common. Not only can they express the data levels, but they can also show the ratio or proportion of data in each category. This can be considered as why they are suitable for displaying the numerous categories in multivariate categorical data.

Lastly, I’m quite sure there are more graphs to be used than mentioned here. Charts in this article are just examples using Python. If you have any suggestions, please feel free to leave a comment.

Thanks for reading.


Here are some of my data visualization articles that you may find interesting:

  • 8 Visualizations with Python to Handle Multiple Time-Series Data (link)
  • 7 Visualizations with Python to Express changes in Rank over time (link)
  • 9 Visualizations with Python to show Proportions or Percentages instead of a Pie chart (link)
  • 9 Visualizations with Python that Catch More Attention than a Bar Chart (link)
  • Creating Animation to Show 4 Centroid-Based Clustering Algorithms using Python and Sklearn (link)

References


Related Articles