The world’s leading publication for data science, AI, and ML professionals.

Visualizing Multidimensional Categorical Data using Plotly

On the Example of the Famous Mushroom Dataset

Poisonous versus Edible Mushrooms by Evgeniy
Poisonous versus Edible Mushrooms by Evgeniy

IntroductionData Visualization is a key skill of any data expert and applied in a vide variety of areas from scientific research to industrial applications. Hence it is not surprising that over the years multiple tools have evolved to ease the development of data visualizations. In this article I would like to show you a few examples on how you can visualize categorical data using the popular Plotly library. Covered are plots from one to five dimensions. Additionally, I will provide examples on how you can embed explanations.

DataFirst, let’s start with getting the data ready. The data for the article was published as a UCI Machine Learning repository¹. The dataset can also be downloaded from the Kaggle competition website. Once downloaded, you can use the following code to read the data:

# Replace this path with your file path
file_path = r".Datamushrooms.csv"
# read csv data from file into pandas dataframe
df = pd.read_csv(file_path, na_values="?")
#rename columns
df.rename(columns = {"class":"is-edible"}, inplace = True)
df.columns = df.columns

Note that the na_values keyword argument was specified with a ? string. This is done to replace the ? with NaNvalues. Additionally, the target variable was renamed from class to is_edible. The intention is to make this small data-set human readable. Similarly, code was used to replace the default values with human-readable equivalents. At the end of the article you can find a GitHub link to view the full code.

Let’s visually inspect the data:

Mushroom Classification Data-Set - head
Mushroom Classification Data-Set – head

One-dimensional Plots: Pie Charts Now, let’s get started with a simple one-dimensional plot. The pie chart is a classic because it is easy to read and interpret. Therefore, it should not be missed in any categorical data analysis. While you can plot basic pie charts using Plotly Express, the more generic Plotly graph objects (.go) library allows you to customize your charts with ease. This article describes mainly Plotly .go charts.

The pie chart will be drawn using the populationfeature. First, a group-by on population is performed, followed by a size on the number of items (mushrooms) in each group. Then the resulting data is sorted alphabetically by the population name.

## dataframe creation - for plotting
# create new pandas dataframe which contains all counts sorted by population
population_df = (
    df.groupby(["population"])
    .size()
    .reset_index(name="Counts")
    .sort_values(by=["population"])
)

The unique occurrences of each population string are taken as the label. Respectively, the previously derived Counts are taken as values.Note that you can specify several custom colors according to the rgb standard. I recommend choosing discrete colors using RapidTables. In this chart, the colors are stored in earth_colors.

Next, the pull keyword argument is specified. The pull of the figure indicates by how far each sector is exploded out of the center of the chart. Pulling can greatly enhance the readability of your chart, especially if you have a lot of sectors. But bear in mind that the chart will become impossible to read if the cardinality of a feature is too high.

## Creating a pie chart
# create labels using all unique values in the column named "population"
labels = population_df["population"].unique()
# group by count of the "population" column.
values = population_df["Counts"]
# Custom define a list of colors to be used for the pie chart.
# Note that the same number of colors are specified as there are unique populations. It is not mandatory, but
# will avoid a single color to be used multiple times.
earth_colors = [
    "rgb(210,180,140)",
    "rgb(218,165,32)",
    "rgb(139,69,19)",
    "rgb(175, 51, 21)",
    "rgb(35, 36, 21)",
    "rgb(188,143,143)",
]
# defining the actual figure using the dimension: population
# Note that a pull keyword was specified to explode pie pieces out of the center
fig = go.Figure(
    data=[
        go.Pie(
            labels=labels,
            values=values,
            # pull is given as a fraction of the pie radius
            pull=[0, 0, 0.07, 0.08, 0.02, 0.2],
            # iterate through earth_colors list to color individual pie pieces
            marker_colors=earth_colors,
        )
    ]
)
# Update layout to show a title
fig.update_layout(title_text="Mushroom Polulation")
# display the figure
fig.show()
Mushroom Population Pie Chart
Mushroom Population Pie Chart

Mushrooms are classified by the kind of populations they habituate in. This example of a pie chart represents the proportion of mushrooms in each population. Most mushrooms live in groups of several, while less than 15% of the mushrooms live in groups that can be described as numerous, abundant, or clustered. Approximately one third of the mushrooms contained in this dataset occur either solitarily or scattered.

Two-dimensional Plots: Pie ChartsAs you likely guessed, this dataset is frequently used to classify mushrooms as either poisonous or edible. Therefore, the curious mind would like to know if the proportion of mushrooms belonging to either population differs between edible and poisonous mushrooms?

This can be achieved using two pie chart subplots, where each subplot corresponds to either the poisonous or the edible class. First, the data needs to be prepared. As in the previous example a group by plus a sizeis performed. Afterwards the result is sorted according to the population variable. This time however, we first use the .loc function to filter for either poisonous or edible entries. Resulting are two pandas’ data frames summarizing the mushrooms counts.

## dataframe creation - for plotting
# create new pandas dataframe which contains all counts filtered by 'is-edible' == "edible" and sorted by population
edible_population_df = (
    df.loc[df["is-edible"] == "edible"]
    .groupby(["population"])
    .size()
    .reset_index(name="Counts")
    .sort_values(by=["population"])
)
# create new pandas dataframe which contains all counts filtered by 'is-edible' == "poisonous" and sorted by population
poisonous_population_df = (
    df.loc[df["is-edible"] == "poisonous"]
    .groupby(["population"])
    .size()
    .reset_index(name="Counts")
    .sort_values(by=["population"])
)
# get unique values from the just created pandas dataframes and store them in an array
labels_edible_population = edible_population_df["population"].unique()
labels_poisonous_population = poisonous_population_df["population"].unique()
# get all the counts from the created pandas dataframes and store them in an array
values_edible_population = edible_population_df["Counts"]
values_poisonous_population = poisonous_population_df["Counts"]

The data frames are then passed to the Plotly graph objects for plotting. In fact, first two subplots are created and arranged using the provided keywords rows, colsand specs. Then the marker colors are defined under the variable earth_colors. Then, the data is passed to the two figure objects using add_trace. Again, labels and values are specified, the charts are given names and we indicate which chart will go to which position. Using update_tracewe allow the pie charts to get even more interactive by displaying hover information to the user. Finally, the figure layout is updated to display a chart title and pie chart descriptions.

## Creating two pie charts
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{"type": "domain"}, {"type": "domain"}]])
# create an array of colors which will be custom colors to the plot
earth_colors = [
    "rgb(210,180,140)",
    "rgb(218,165,32)",
    "rgb(139,69,19)",
    "rgb(175, 51, 21)",
    "rgb(35, 36, 21)",
    "rgb(188,143,143)",
]
# crate traces to specify the various properties of the first pie chart subplot
fig.add_trace(
    go.Pie(
        labels=labels_edible_population,
        values=values_edible_population,
        name="Edible Mushroom",
        marker_colors=earth_colors,
    ),
    1,
    1,
)
# crate traces to specify the various properties of the second pie chart subplot
fig.add_trace(
    go.Pie(
        labels=labels_poisonous_population,
        values=values_poisonous_population,
        name="Poisonous Mushroom",
        marker_colors=earth_colors,
    ),
    1,
    2,
)
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=0.4, hoverinfo="label+percent+name")
# adapt layout of the chart for readability
fig.update_layout(
    title_text="Mushroom Population by Edibility",
    # Add annotations in the center of the donut pies.
    annotations=[
        dict(text="Edible", x=0.18, y=0.5, font_size=17, showarrow=False),
        dict(text="Poisonous", x=0.82, y=0.5, font_size=17, showarrow=False),
    ],
)
fig.show()
Mushroom Population by Edibility Pies
Mushroom Population by Edibility Pies

Indeed, the proportions differ significantly between poisonous and edible mushrooms! The first thing that comes to eye is that almost three quarters of all poisonous mushrooms live in populations described as "numerous". On the other hand, only about 10% of edible mushrooms are counted to the "numerous" sector. Furthermore, all solitary mushrooms are edible. In conclusion, if we were to design a classification algorithm, we would know that "population" is indeed an important feature with respect to the target variable.

Three-dimensional Plots: Bar Charts How many dimensions of categorial data can be illustrated without diminishing clarity? Indeed, plotting three categories is still straight forward when using bar charts. In one-dimensional bar charts the category is on the x-axis and the numbers are on the y-axis. However, to add more dimensions, one can make use of coloring and shading.

In the following bar plot, the mushroom characteristic cap-shapeis plotted on the x-axis. The color is given by is-edible and the pattern is given by gill-size. As previously, a .go is created. Then the .go figure is updated using the Plotly express histogram function to which we pass the data frame and specify the various key-word arguments. Note that a discrete color sequence needs to be given in addition to the colors argument. Furthermore, Plotly allows for automatic normalization of the histogram values. Normalization is achieved by either using histnormor barnorm. You choose either depending on your needs. For more information you can look here.

In the next step, the layout is updated to display a title text. The x keyword argument in the tile dictionary does not stand for data. Instead, it allows the user to specify a value to center the titles. Again, you can look up all the available options.

Last but not least, one can manually specify the desired axis titles by specifying the yaxis_title and xaxis_title argument. In the very end, the update_xaxes argument was specified in such a way that the categories were sorted in alphabetically ascending order.

## Creating bar chart
# define figure element
fig = go.Figure()
# define histogram properties
fig = (
    px.histogram(
        df,
        x="cap-shape",
        color="is-edible",
        pattern_shape="gill-size",
        color_discrete_sequence=[medimumvioletred, seagreen],
        barmode="relative",
        # barnorm = "percent",
        # histnorm = "probability",
        text_auto=True,
        labels={"cap-shape": "Mushroom Cap-Shape & Gill Size",},
    )
    .update_traces(hoverinfo="all")
    .update_layout(
        # update layout with titles
        title={
            "text": "Percent - Dataset: Cap Shape & Gill Size - Edibility",
            "x": 0.5,
        },
        yaxis_title="Absolute Counts",
    )
    .update_xaxes(categoryorder="category ascending")
)
# display the figure
fig.show()

Shown is the absolute number of poisonous and edible mushrooms according to cap-shape and gill-size. Most mushroom’s cap (top) is either convex or flat shaped. Convex, flat and bell cap-shaped mushrooms can either be poisonous or edible. Indeed, only sunken cap-shaped mushrooms are solely poisonous, while sunken cap-shaped mushrooms are edible. Note that sunken and conical shaped mushrooms occur the least frequently. Convex and flat cap-shaped mushrooms have narrow and broad shaped gill-sizes in both poisonous and edible categories.

Four-dimensional Plots: Sunburst Charts

With increasing number of dimensions, plots get harder to interpret and to program. The sunburst chart visualizes hierarchical data spanning outwards radially from root to leaves and can be a great option depending on the data-set.

The sunburst plot with Plotly express is created as described next. The data is passed to the sunburst object. Then a path is specified. The path corresponds to a list of columns which one would like to plot in the sunburst chart. The columns are passed in the order in which one would like them to be drawn. For example, if one wants is-edible to be in the center of the plot, it should be named first.

The values keyword is used to make each wedge sized proportionally to the given value. In this example the countcolumn is passed to values.

## dataframe creation - for plotting
df_combinations = (
    df.groupby(["is-edible", "population", "habitat", "bruises"])
    .size()
    .reset_index()
    .rename(columns={0: "count"})
)
## Creating sunburst chart
# define figure element
fig = px.sunburst(
    df,
    path=["is-edible", "bruises", "population", "habitat"],
    title="Edibility Mushrooms - bruises, population & habitat",
    color="is-edible",
    color_discrete_sequence=[medimumvioletred, seagreen],
    height=800,
)
# display the figure
fig.show()
Sunburst Chart
Sunburst Chart

The chart shows the number of mushrooms proportionally classified by their characteristics from root node to leaf node. While there is almost an equal amount of poisonous and edible mushrooms, one can tell that poisonous mushrooms are bruised much less frequently. Poisonous mushrooms which are not bruised and live in populations of several grow in woods, in leaves, on grass or on paths. Bruised, poisonous mushrooms live in scattered populations and preferably in urban or grassed habitats. Edible mushrooms with bruises live mostly in solitary or scattered populations and most frequently in wooded areas. Edible and not bruised mushrooms live mostly in scattered and abundant populations and prefer habitats of grasses and leaves. In conclusion, the sunburst chart shows how mushrooms differ grouped by the various characteristics. Notice that the explanation needs to cover a wide range of details.

Five-dimensional Plots: Parallel Coordinates Diagram

The parallel coordinates diagram (parcats) is an elegant tool that can be used to visualize multiple categorical variables.

To generate the plot, first map the values of the target variable to integers. This is done because the parcats function does not accept a string object. Then, the five-dimension objects are created according to the variables that one would like to use. The colors are defined as usual with the exception that the previously created variable is-edible-int is passed to color. Finally, the figure object is defined, passing the dimension variables and specifying various other key word arguments to increase the readability of the chart.

## Creating parallel categores chart
# creation is edible integer
df["is-edible-bool"] = df["is-edible"].map({"edible": 1, "poisonous": 0})
# Create dimensions
# stalk-shape
stalk_shape_dim = go.parcats.Dimension(
    values=df["stalk-shape"], categoryorder="category ascending", label="Stalk-Shape"
)
# stalk-root
stalk_root_dim = go.parcats.Dimension(values=df["stalk-root"], label="Stalk-Root")
# stalk-surface-above-ring
stalk_surface_above_ring_dim = go.parcats.Dimension(
    values=df["stalk-surface-above-ring"], label="Stalk-Surface-above-Ring"
)
# stalk-surface-below-ring
stalk_surface_bellow_ring_dim = go.parcats.Dimension(
    values=df["stalk-surface-below-ring"], label="Stalk-Surface-bellow-Ring"
)
# is-edible
edible_dim = go.parcats.Dimension(
    values=df["is-edible"],
    label="Is Edibile",
    categoryarray=["edible", "poisonous"],
    ticktext=["edible", "poisonous"],
)
# Create parcats trace
color = df["is-edible-bool"]
colorscale = [[0, medimumvioletred], [1, seagreen]]
# create figure object
fig = go.Figure(
    data=[
        go.Parcats(
            dimensions=[
                stalk_shape_dim,
                stalk_surface_above_ring_dim,
                stalk_root_dim,
                stalk_surface_bellow_ring_dim,
                edible_dim,
            ],
            line={"color": color, "colorscale": colorscale},
            hoveron="color",
            hoverinfo="count + probability",
            labelfont={"size": 18, "family": "Times"},
            tickfont={"size": 16, "family": "Times"},
            arrangement="freeform",
        )
    ]
)
# display the figure
fig.show()
Parallel Coordinates Plot
Parallel Coordinates Plot

A mushroom distinguishes itself from other mushrooms through its stalk-shape, stalk-surface-above-the-ring, stalk-root, and its stalk-surface-below-the-ring. Can those features explain a mushroom’s edibility ?Indeed, it does not seem to make much of a difference weather a mushroom stalk-shape is enlarging or tapering. However, the surface above the "mushroom-ring" gives more information. If we were to select a mushroom with a smooth or fibrous stalk-surface one can feel more confident that the mushroom is edible. A silky surface above the mushroom-ring should be avoided. The "stalk-root" should be club or equal. If the stalk surface below the ring is smooth, again chances are good an edible mushroom has been found. The features as a combination increase the chance to distinguish poisonous and edible mushrooms.

Visualizing categorical data in a higher dimensional space is a challenge. If you are interested in how other people solved plotting multidimensional data, I can recommend "The Art of Effective Visualization of Multi-dimensional Data" article from Dipanjan (Matplotlib & Seaborn). For more detailed information on how to work with plotly graphs I can recommend this guide.

Hopefully you found this article entertaining and could learn something new. The code to reproduce the plots can be found on my GitHub Repo . Note that Plotly is open source and can be used for commercial and for private purposes.

I would like to give a big thanks to Evgeniy who was very kind and provided the artwork for this article. If you enjoy his paintings follow him on Instagram.

Edible Mushroom by Evgeniy
Edible Mushroom by Evgeniy

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


Related Articles