
Creating a boxplot is the most popular way of displaying the statistical summary for a dataset. However, sometimes, we may need to visualize additional statistical information and take a more granular look at our data. This is where other types of charts come into play: violin, strip, and swarm plots, as well as their hybrids, the most interesting of which is a raincloud plot. In this article, we’re going to explore these alternatives to a boxplot in the seaborn library of Python and to find out in which cases each of them is the most applicable.
For our further experiments, we’ll use one of the example datasets of seaborn – diamonds
. Let’s download it and take a quick look at it:
import matplotlib.pyplot as plt
import Seaborn as sns
%matplotlib inline
diamonds = sns.load_dataset('diamonds')
print(f'Number of diamonds: {diamonds.shape[0]:,}n'
f"Diamond cut types: {diamonds['cut'].unique().tolist()}n"
f"Diamond colors: {sorted(diamonds['color'].unique().tolist())}nn"
f'{diamonds.head(3)}n')
Output:
Number of diamonds: 53,940
Diamond cut types: ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair']
Diamond colors: ['D', 'E', 'F', 'G', 'H', 'I', 'J']
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
The dataset is rather big. Let’s narrow our focus only on the diamonds of ideal or premium cuts that are more than 2 carats, and work only with this smaller dataset. Well, we’re interested only in the best diamonds! 😀
df = diamonds[((diamonds['cut']=='Ideal')|(diamonds['cut']=='Premium')) & (diamonds['carat']>2)]
print(f'Number of diamonds in "df": {df.shape[0]:,}')
Output:
Number of diamonds in "df": 1,216
Boxplot
Now, we can create a boxplot for the price range of each diamond color category. The colors are denoted in capital letters, and we can find more information on diamond color grading scales in this Wikipedia article. According to the denotations, the diamonds in our dataset are all colorless or near-colorless.
The main scope of a boxplot is to show the five-number set of descriptive statistics for a dataset: the minimum and maximum values, the median, the first (Q1) and the third (Q3) quartiles. In addition, it displays the upper and lower outliers (if any), and we also have an option to add the sixth dimension on the graph – the mean value:
sns.set_style('white')
plt.figure(figsize=(12, 7))
sns.boxplot(x='price', y='color', data=df, color='yellow', width=0.6, showmeans=True)
# Create a function to customize the axes of all the subsequent graphs in a uniform way.
def add_cosmetics(title='Prices by color for ideal/premium cut diamonds > 2 ct',
xlabel='Price, USD', ylabel='Color'):
plt.title(title, fontsize=28)
plt.xlabel(xlabel, fontsize=20)
plt.ylabel(ylabel, fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine()
add_cosmetics()

Apart from customizing the figure and axes, we practically wrote a one-line code in seaborn to create the boxplots above. We adjusted only the plot color and width and added the mean value on each box.
The boxplots above clearly show the overall statistics for the price range of each color category. Moreover, from their form, presence of lower outliers, the mean values being almost in all the cases lower than the median ones, we can suppose that the distribution of prices in each case is left-skewed, meaning that the prices of diamonds tend to be rather high. However, we can’t understand the actual shape and structure of the underlying data distributions looking just at these plots. For example, is the distribution for a particular color category unimodal or multimodal? How many observations does each category contain? Are the sample sizes comparable across the categories? Where exactly the individual observations are situated in each distribution?
Let’s see if creating a Violin Plot helps us to answer these questions.
Violin plot
A violin plot is similar to a box plot and shows the same statistical summary for a dataset, except that it also displays the kernel density plot of the underlying data:
plt.figure(figsize=(12, 8))
sns.violinplot(x='price', y='color', data=df, color='yellow', cut=0)
add_cosmetics()

We only adjusted the cut
parameter setting it to 0. This constrains each violin within the range of the actual data, without extending it outwards.
Returning to our questions above, we can say that, in addition to the overall statistics for each category taken from the "mini-boxplots" in the center of each violin, we see now the shape of each distribution. And yes, our assumption about left-skewed distributions is now perfectly confirmed.
What about the structure of the underlying data by category? We can tune the inner
parameter to visualize the location and density of the observations inside each violin:
plt.figure(figsize=(12, 8))
sns.violinplot(x='price', y='color', data=df, color='yellow', cut=0,
inner='stick')
add_cosmetics()

Now we see the observation density along each category’s range. Evidently, there are much fewer diamonds of D and E colors than those of H and I, even though the corresponding distribution shapes look very similar.
However, tuned that parameter, we can’t see anymore the miniature boxplot inside each violin. Furthermore, we still can’t see each underlying data point.
Strip and swarm plots
These two types of plots represent an implementation of a scatterplot for a categorical variable, i.e., they both show exactly the inner structure of distribution, in particular, its sample size and the location of the individual observations. The main difference is that in a swarm plot, the data points don’t overlap and are adjusted along the categorical axis. On the other hand, the issue of point overlapping in a strip plot can be partially fixed by setting the alpha
parameter that regulates point transparency.
Let’s compare these plots:
plt.figure(figsize=(16, 11))
plt.subplot(2, 1, 1)
sns.stripplot(x='price', y='color', data=df, color='blue',
alpha=0.3, size=4)
add_cosmetics(xlabel=None)
plt.subplot(2, 1, 2)
sns.swarmplot(x='price', y='color', data=df, color='blue', size=4)
add_cosmetics(title=None)
plt.tight_layout()

The main drawback of both strip and swarm plots is that they work well only on relatively small datasets. Also, they don’t show the five-number descriptive statistics, as boxplots and violin plots do.
Hybrid plots
To avoid losing valuable information and to combine the strengths of different chart types, we can consider creating hybrid plots. Let’s combine, for example, violin and swarm plots for each category:
plt.figure(figsize=(15, 8))
sns.violinplot(x='price', y='color', data=df, color='yellow', cut=0)
sns.swarmplot(x='price', y='color', data=df, color='blue')
add_cosmetics()

We clearly see now that the inner structure of violins varies considerably across the categories, despite their external shape being rather comparable. Practically, for the D and E color categories with very few data points, creating violin plots doesn’t actually make sense and even can lead to wrong estimations. However, for the categories with many data points, the combination of swarm and violin plots helps understand a bigger picture.
It’s worth noting that on the graph above, we almost can’t see anymore the mini-boxplots covered by points (unless we decide to introduce the alpha
parameter), so we’ll remove the boxes. Also, let’s add another dimension to the swarm plot: distinguishing between the data points for ideal and premium diamond cuts:
plt.figure(figsize=(15, 8))
sns.violinplot(x='price', y='color', data=df, color='yellow',
cut=0, inner=None)
sns.swarmplot(x='price', y='color', hue='cut', data=df,
palette=['blue', 'deepskyblue'])
plt.legend(frameon=False, fontsize=15, loc='upper left')
add_cosmetics()

We can observe that the relatively "cheap" diamonds are mostly of premium cut, instead of the higher-classified ideal cut.
Strip and swarm plots are good for distinguishing individual data points from different groups if the number of groups doesn’t exceed three. For the same purpose, we could try instead another approach: creating grouped violin plots for ideal and premium cuts separately by color category. However, considering that some of our color categories are already very small, splitting them for creating grouped violin plots would lead to a further decrease of the sample size and data density of each part, making such plots even less representative. Hence, in this case, strip and swarm plots look like a better choice.
There is one type of hybrid plots that deserves special attention, so let’s discuss it in more detail.
Raincloud plot
A raincloud plot is essentially a combination of half-violin, box, and strip plots. Placed in this succession from top to bottom, these plots altogether remind a raincloud, hence the name of the hybrid plot. Unfortunately, there is no predefined code solution for this kind of plots, neither in seaborn nor in Python in general (at least for now and at least in an easy-to-use and comprehensible form). Hence, we’ll create it from scratch, combining and tuning the available tools. The technical details at each step are explained in the code comments:
plt.figure(figsize=(15, 10))
# Create violin plots without mini-boxplots inside.
ax = sns.violinplot(x='price', y='color', data=df,
color='mediumslateblue',
cut=0, inner=None)
# Clip the lower half of each violin.
for item in ax.collections:
x0, y0, width, height = item.get_paths()[0].get_extents().bounds
item.set_clip_path(plt.Rectangle((x0, y0), width, height/2,
transform=ax.transData))
# Create strip plots with partially transparent points of different colors depending on the group.
num_items = len(ax.collections)
sns.stripplot(x='price', y='color', hue='cut', data=df,
palette=['blue', 'deepskyblue'], alpha=0.4, size=7)
# Shift each strip plot strictly below the correponding volin.
for item in ax.collections[num_items:]:
item.set_offsets(item.get_offsets() + 0.15)
# Create narrow boxplots on top of the corresponding violin and strip plots, with thick lines, the mean values, without the outliers.
sns.boxplot(x='price', y='color', data=df, width=0.25,
showfliers=False, showmeans=True,
meanprops=dict(marker='o', markerfacecolor='darkorange',
markersize=10, zorder=3),
boxprops=dict(facecolor=(0,0,0,0),
linewidth=3, zorder=3),
whiskerprops=dict(linewidth=3),
capprops=dict(linewidth=3),
medianprops=dict(linewidth=3))
plt.legend(frameon=False, fontsize=15, loc='upper left')
add_cosmetics()

From the raincloud plots above, we can extract full statistical information about the price range for each color category: the overall five-number statistics, the mean value, the distribution shape, the sample size, the inner structure of the underlying data including the location of the individual data points and distinguishing between two different groups inside each category. Then, we can compare the color categories to understand their relationships and general trends.
To create a vertical raincloud plot, we have to introduce some minor changes in the code above. In particular, we have to substitute x with y and vice versa when creating each type of the inner plots and to clip the right half of each violin (i.e., divide the width by 2 and leave the height untouched). As for the decorative adjustments, we have to exchange x- and y-axis labels, and put the legend in the lower-left corner:
plt.figure(figsize=(15, 10))
# Create violin plots without mini-boxplots inside.
ax = sns.violinplot(y='price', x='color', data=df,
color='mediumslateblue',
cut=0, inner=None)
# Clip the right half of each violin.
for item in ax.collections:
x0, y0, width, height = item.get_paths()[0].get_extents().bounds
item.set_clip_path(plt.Rectangle((x0, y0), width/2, height,
transform=ax.transData))
# Create strip plots with partially transparent points of different colors depending on the group.
num_items = len(ax.collections)
sns.stripplot(y='price', x='color', hue='cut', data=df,
palette=['blue', 'deepskyblue'], alpha=0.4, size=7)
# Shift each strip plot strictly below the correponding volin.
for item in ax.collections[num_items:]:
item.set_offsets(item.get_offsets() + 0.15)
# Create narrow boxplots on top of the corresponding violin and strip plots, with thick lines, the mean values, without the outliers.
sns.boxplot(y='price', x='color', data=df, width=0.25,
showfliers=False, showmeans=True,
meanprops=dict(marker='o', markerfacecolor='darkorange',
markersize=10, zorder=3),
boxprops=dict(facecolor=(0,0,0,0),
linewidth=3, zorder=3),
whiskerprops=dict(linewidth=3),
capprops=dict(linewidth=3),
medianprops=dict(linewidth=3))
plt.legend(frameon=False, fontsize=15, loc='lower left')
add_cosmetics(xlabel='Color', ylabel='Price, USD')

Of course, we can easily make vertical all the previous graphs as well, substituting x with y and vice versa, exchanging x- and y-axis labels, and shifting the legend (where applicable).
Conclusion
In this article, we explored various alternatives to a boxplot in the seaborn library of Python, namely violin, strip, and swarm plots, and their hybrids, including a raincloud plot as a particular case. We discussed the strengths and limitations of each of these types of visualizations, how each of them can be tuned and what kind of information it can reveal. Finally, we considered the modifications to apply for rotating the plots vertically.
Selecting the right type of graphs for a real-world task doesn’t necessarily mean trying to display all possible information from the data. Instead, it depends on the task itself and on the data available. Sometimes, creating just a boxplot is more than enough, while in other cases, we have to dig deeper into the data to obtain meaningful insights and discover hidden trends.
Thanks for reading!
You can find interesting also these articles:
How To Fetch The Exact Values From A Boxplot (Python)