5 Unconventional Ways to Visualize your Data in Python
"Spaceborn" visualizations applied to a UFO dataset

While bar charts, histograms, scatter plots, line charts, and box plots are wide-spread and efficient tools for displaying data and finding patterns in it, there are other graphs, less popular but still very useful for creating excellent visualizations. In this article, we’re going to explore the following ones:
_1. Stem Plot
- Word Cloud
- Treemap
- Venn Diagram
- Swarm Plot_.
To make our experiments with these plots more interesting, we’ll apply them to another type of less known objects: those unidentified flying 🛸 . For this purpose, we’ll use a Kaggle dataset UFO sightings 1969 to 2019 reported in North America.
First, we’ll import the dataset and do some essential cleaning. The province abbreviations were sorted out based on the corresponding Wikipedia pages for the USA and Canada.
import pandas as pd
import numpy as np
df = pd.read_csv('nuforc_reports.csv')
print('Number of UFO sightings:', len(df), 'n')
print(df.columns.tolist())
Output:
Number of UFO sightings: 88125
['summary', 'city', 'state', 'date_time', 'shape', 'duration', 'stats', 'report_link', 'text', 'posted', 'city_latitude', 'city_longitude']
Data cleaning:
# Leaving only the necessary columns
df = df[['city', 'state', 'date_time', 'shape', 'text']]
# Removing rows with missing values
df = df.dropna(axis=0).reset_index(drop=True)
# Fixing an abbreviation duplication issue
df['state'] = df['state'].apply(lambda x: 'QC' if x=='QB' else x)
# Creating a list of Canadian provinces
canada = ['ON', 'QC', 'AB', 'BC', 'NB', 'MB',
'NS', 'SK', 'NT', 'NL', 'YT', 'PE']
# Creating new columns: `country`, `year`, `month`, and `time`
df['country'] = df['state'].apply(
lambda x: 'Canada' if x in canada else 'USA')
df['year'] = df['date_time'].apply(lambda x: x[:4]).astype(int)
df['month'] = df['date_time'].apply(lambda x: x[5:7]).astype(int)
df['month'] = df['month'].replace({1: 'Jan', 2: 'Feb', 3: 'Mar',
4: 'Apr', 5: 'May', 6: 'Jun',
7: 'Jul', 8: 'Aug', 9: 'Sep',
10: 'Oct', 11: 'Nov', 12: 'Dec'})
df['time'] = df['date_time'].apply(lambda x: x[-8:-6]).astype(int)
# Dropping an already used column
df = df.drop(['date_time'], axis=1)
# Dropping duplicated rows
df = df.drop_duplicates().reset_index(drop=True)
print('Number of UFO sightings after data cleaning:', len(df), 'n')
print(df.columns.tolist(), 'n')
print(df.head(3))
Output:
Number of UFO sightings after data cleaning: 79507
['city', 'state', 'shape', 'text', 'country', 'year', 'month', 'time']
city state shape
0 Chester VA light
1 Rocky Hill CT circle
2 Ottawa ON teardrop
text country year month time
0 My wife was driving southeast on a fa... USA 2019 Dec 18
1 I think that I may caught a UFO on th... USA 2019 Mar 18
2 I was driving towards the intersectio... Canada 2019 Apr 2
Now we have a cleaned dataset of 79,507 UFO sightings that occured from 1969 till 2019 inclusive in the USA and Canada. It results that the predominant majority of them (96%) is related to the USA:
round(df['country'].value_counts(normalize=True)*100)
Output:
USA 96.0
Canada 4.0
Name: country, dtype: float64
Let’s finally start our ufological experiments.
1. Stem Plot
A stem plot represents a kind of a modified bar plot. Indeed, it’s a good alternative to both bar plots (especially those with a lot of bars, or with bars of similar length) and pie plots, since it helps to maximize data-ink ratio of a chart, making it more readable and comprehensible.
To create a stem plot, we can use the stem()
function, or the hlines()
and vlines()
functions. The stem()
function plots vertical lines at each x location from the baseline to y, and places a marker there.
We’ll start by creating a basic stem plot of UFO occurences by month, adding only some common matplotlib customization. For a classical (horizontal) stem plot, we can use either stem()
or vlines()
– the result will be the same.
import matplotlib.pyplot as plt
import seaborn as sns
# Creating a series object for UFO occurences by month, in %
months = df['month'].value_counts(normalize=True)
[['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']]*100
# Defining a function for creating and customizing a figure in matplotlib (will be used for the next 3 plots)
def create_customized_fig():
fig, ax = plt.subplots(figsize=(12,6))
plt.title('UFO occurences by month, %', fontsize=27)
plt.ylim(0,15)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
ax.tick_params(bottom=False)
sns.despine()
return ' '
# PLOTTING
create_customized_fig()
# Creating a stem plot
plt.stem(months.index, months)
# ALTERNATIVE WAY TO CREATE A STEM PLOT
# plt.vlines(x=months.index, ymin=0, ymax=months)
# plt.plot(months.index, months, 'o')
plt.show()

We see that the majority of UFO sightings in the USA and Canada are related to summer-autumn seasons, with a maximum around 12% in July, while in the winter-spring period there is much less activity, with a minimum 5% in February.
There are a few optional parameters for adjusting a stem plot:
linefmt
– a string defining the properties of the vertical lines (color or line style). The lines can be solid ('-'
), dashed ('--'
), dash-dot ('-.'
), dotted (':'
), or there can be no lines at all.markerfmt
– a string defining the properties of the markers at the stem heads:'o'
,'*'
,'D'
,'v'
,'s'
,'x'
, etc., including' '
for the absence of markers.basefmt
– a string defining the properties of the baseline.bottom
– the y-position of the baseline.
Let’s apply them to our plot:
# Creating and customizing a figure in matplotlib
create_customized_fig()
# Creating and customizing a stem plot
plt.stem(months.index, months,
linefmt='C2:', # line color and style
markerfmt='D',
basefmt=' ')
plt.show()

There are also some other properties, such as linewidth
and markersize
, not included in the standard keyword arguments of the stem()
function. To tune them, we have to create markerline
, stemlines
, and baseline
objects:
# Creating and customizing a figure in matplotlib
create_customized_fig()
# Creating `markerline`, `stemlines`, and `baseline` objects
# with the same properties as in the code above
markerline, stemlines, baseline = plt.stem(months.index, months,
linefmt='C2:',
markerfmt='D',
basefmt=' ')
# Advanced stem plot customization
plt.setp(markerline, markersize=10)
plt.setp(stemlines, 'linewidth', 5)
markerline.set_markerfacecolor('yellow')
plt.show()

Finally, we can consider creating a vertical stem plot. However, in this case, we can’t use the stem()
function anymore, since it draws only vertical lines. Instead, we can use hlines()
in combination with plot()
. Apart from the necessary parameters y
, xmin
, and xmax
, we can tune also the optional parameters color
and linestyle
('solid'
, 'dashed'
, 'dashdot'
, 'dotted'
). In addition, we have plenty of options to adjust in the plot()
function itself, including colors, markers, and lines.
Let’s create a vertical stem plot for the UFO shape frequency distribution, to check whether some shapes are more common than the others:
# Creating a series of shapes and their frequencies
# in ascending order
shapes = df['shape'].value_counts(normalize=True,
ascending=True)*100
fig, ax = plt.subplots(figsize=(12,9))
# Creating a vertical stem plot
plt.hlines(y=shapes.index,
xmin=0, xmax=shapes,
color='slateblue',
linestyle='dotted', linewidth=5)
plt.plot(shapes, shapes.index,
'*', ms=17,
c='darkorange')
plt.title('UFO shapes by sighting frequency, %', fontsize=29)
plt.xlim(0,25)
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)
ax.tick_params()
sns.despine()
plt.show()

We see that UFO, according to their witnesses, can take a wide range of incredible forms, including diamonds, cigars, chevrons, teardrops, and crosses. The far most frequent form (22%), however, is described as just a light.
Here a vertical stem plot looks a better choice, since the names of the shapes are rather long, and in a horizontal plot they would be flipped vertically, reducing their readability.
As a reminder, for creating horizontal stem plots, we can use a similar function vlines()
instead of stem()
. All the parameters are the same as for hlines()
, except for the "mirrored" necessary parameters x
, ymin
, and ymax
.
It’s enough with the stem plot customization. Let’s learn something else about our friends aliens.
2. Word Cloud
A word cloud is a text Data Visualization, where the size of each word indicates its frequency. Using it, we can find the most important words in any piece of text.
Let’s analyze all the descriptions of UFO sightings given by American witnesses. For this purpose, we’ll install and import the wordcloud
library (installation: pip install wordcloud
), and create a basic graph:
from wordcloud import WordCloud, STOPWORDS
# Gathering sighting descriptions from all American witnesses
text = ''
for t in df[df['country']=='USA'].loc[:, 'text']:
text += ' ' + t
fig = plt.subplots(figsize=(10,10))
# Creating a basic word cloud
wordcloud = WordCloud(width=1000, height=1000,
collocations=False).generate(text)
plt.title('USA collective description of UFO', fontsize=27)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
# Saving the word cloud
wordcloud.to_file('wordcloud_usa.png')

The most common words are light, object, and sky, followed by bright, time, moving, white, red, craft, star. Among the most frequent words, there are some low-informative ones, like one, second, saw, see, seen, looked, etc. We can assume that American witnesses mostly observed bright craft objects of white or red color, moving in the sky and emitting light.
In the word cloud above, we used the following parameters:
width
andheight
– the width and height of the word cloud canvas.collocations
– whether to include collocations of two words. We set it toFalse
to avoid word duplication in the resulting graph.
To add more advanced functionality to our word cloud, we can use the following parameters:
colormap
– a matplotlib colormap to draw colors from for each word.background_color
– word cloud background color.stopwords
– the words to be excluded from the analysis. The library already has the built-inSTOPWORDS
list containing some low-informative words like how, not, the, etc. This list can be supplemented with a user word list, or replaced with it.prefer_horizontal
– the ratio of times to try horizontal fitting as opposed to vertical. If this parameter is less than 1, the algorithm will try rotating the word if it doesn’t fit.include_numbers
– whether to include numbers as phrases or not (False
by default).random_state
– a seed number used for reproducing always the same cloud.min_word_length
– a minimum number of letters a word must have to be included.max_words
– a maximum number of words to display in the word cloud.min_font_size
andmax_font_size
– maximum and minimum font sizes to be used for displaying words.
Armed with this new information, let’s create a more tuned word cloud. We’ll add a colormap for the words and background color, reduce the maximum number of words from 200 (by default) to 100, consider only the words with 3+ letters (to avoid words like u and PD), allow more vertical words (0.85 instead of the default 0.9), exclude some low-informative words from the analysis and ensure the replicability of the word cloud.
This time, however, we’re curious to know Canadian people’s collective opinion about UFO:
# Gathering sighting descriptions from all Canadian witnesses
text = ''
for t in df[df['country']=='Canada'].loc[:, 'text']:
text += ' ' + t
# Creating a user stopword list
stopwords = ['one', 'two', 'first', 'second', 'saw', 'see', 'seen',
'looked', 'looking', 'look', 'went', 'minute', 'back',
'noticed', 'north', 'south', 'east', 'west', 'nuforc',
'appeared', 'shape', 'side', 'witness', 'sighting',
'going', 'note', 'around', 'direction', approximately',
'still', 'away', 'across', 'seemed', 'time']
fig = plt.subplots(figsize=(10,10))
# Creating and customizing a word cloud
wordcloud = WordCloud(width=1000, height=1000,
collocations=False,
colormap='cool',
background_color='yellow',
stopwords=STOPWORDS.update(stopwords),
prefer_horizontal=0.85,
random_state=100,
max_words=100,
min_word_length=3).generate(text)
plt.title('Canadian collective description of UFO', fontsize=27)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
# Saving the word cloud
wordcloud.to_file('wordcloud_canada.png')

It seems that the descriptions given by Canadian people are rather similar to those from Americans, with the addition of some other frequent words: orange, plane, night, minutes, seconds, cloud, flying, speed, sound. We can assume that Canadians witnessed bright craft objects, of white, red, or orange color, mostly at night time, moving/flying in the sky and emitting light and, probably, sound. At first, the objects looked like stars, planes, or clouds, and the whole process lasted several seconds to minutes.
The difference between Canadian and American collective descriptions can be partially explained by adding some more words to the stopword list. Or, maybe, "Canadian" aliens are really more orange, plane- or cloud-like, and noisy 😀
3. Treemap
A treemap is a visualization of hierarchical data as a set of nested rectangles, where the area of each rectangle is proportional to the value of the corresponding data. In other words, treemaps show what the whole data consists of, and can be a good alternative to pie charts.
Let’s find out what states of the USA are especially preferred for UFO to visit. We’ll install and import the squarify
library (installation: pip install squarify
), and create a basic treemap:
import squarify
# Extract the data
states = df[df['country']=='USA'].loc[:, 'state'].value_counts()
fig = plt.subplots(figsize=(12,6))
# Creating a treemap
squarify.plot(sizes=states.values, label=states.index)
plt.title('UFO sighting frequencies by state, the USA', fontsize=27)
plt.axis('off')
plt.show()

Looks like California is a real extraterrestrial base in the USA! It’s followed with a big gap by Florida, Washington, and Texas, while the territories of District of Columbia and Puerto Rico are visited by UFO very rarely.
The parameters sizes
and label
used above represent the numeric input for squarify
and the corresponding label text. Other parameters that can be adjusted:
color
– a user list of colors for the rectangles,alpha
– a parameter regulating color intensity,pad
– whether to draw rectangles with a small gap between them,text_kwargs
– a dictionary of keyword arguments (color
,fontsize
,fontweight
, etc.) to tune the label text properties.
Let’s check at what time the most/least aliens were seen, and in the meanwhile practice the optional parameters:
import matplotlib
# Extracting the data
hours = df['time'].value_counts()
# Creating a list of colors from 2 matplotlib colormaps
# `Set3` and `tab20`
cmap1 = matplotlib.cm.Set3
cmap2 = matplotlib.cm.tab20
colors = []
for i in range(len(hours.index)):
colors.append(cmap1(i))
if cmap2(i) not in colors:
colors.append(cmap2(i))
fig = plt.subplots(figsize=(12,6))
# Creating and customizing a treemap
squarify.plot(sizes=hours.values, label=hours.index,
color=colors, alpha=0.8,
pad=True,
text_kwargs={'color': 'indigo',
'fontsize': 20,
'fontweight': 'bold'})
plt.title('UFO sighting frequencies by hour', fontsize=27)
plt.axis('off')
plt.show()

The respondents from our dataset mostly observed UFO in the time range from 20:00 till 23:00, or, more generally, from 19:00 till midnight. The least "UFO-prone" hours are 07:00–09:00. However, it doesn’t necessarily mean the "lack of aliens" in certain hours of the day and instead can be explained more pragmatically: usually people have free time in the evening after work, while in the morning the majority of people are going to work and are a bit too immersed in their thoughts to notice interesting phenomena around them.
4. Venn Diagram
A Venn diagram shows the relationships between several datasets, where each group is displayed as an area-weighted circle, and the overlaps (if any) of the circles represent the intersection and its size between the corresponding datasets. In Python, we can use the matplotlib-venn
library to create Venn diagrams for 2 or 3 datasets. For the first case, the package provides the venn2
and venn2_circles
functions, for the second –venn3
and venn3_circles
.
Let’s practice this tool on 2 subsets from our UFO dataset. For example, we want to extract the data for all cross-shaped and cigar-shaped UFO sightings (for simplicity, we’ll call them from now on crosses and cigars) that occured in North America in the last 5 years (which in the context of our dataset means from 2015 till 2019 inclusive), and check if there are some cities where both shapes were observed in that period. Let’s install and import the matplotlib-venn
library (installation: pip install matplotlib-venn
), and create a basic Venn diagram for crosses and cigars:
from matplotlib_venn import *
# Creating the subsets for crosses and cigars
crosses = df[(df['shape']=='cross')&
(df['year']>=2015)&(df['year']<=2019)].loc[:, 'city']
cigars = df[(df['shape']=='cigar')&
(df['year']>=2015)&(df['year']<=2019)].loc[:, 'city']
fig = plt.subplots(figsize=(12,8))
# Creating a Venn diagram
venn2(subsets=[set(crosses), set(cigars)],
set_labels=['Crosses', 'Cigars'])
plt.title('Crosses and cigars by number of cities, 2015-2019',
fontsize=27)
plt.show()

In the period from 2015 till 2019 inclusive, there were 18 cities in North America where both crosses and cigars were registered. In 79 cities, only crosses were observed (from these 2 shapes), in 469 – only cigars.
Now, we’re going to add one more exotic UFO shape from our collection –diamonds – and apply some customization to the Venn diagram. Earlier, we’ve already used a self-explanatory optional parameter set_labels
. In addition, we can add to the venn2()
and venn3()
functions:
set_colors
– a list of colors of the circles, based on which the colors of intersections will be computed,alpha
– a parameter regulating color intensity, 0.4 by default.
The other 2 functions – venn2_circles()
and venn3_circles()
– serve to adjust the circumferences of the circles using the parameters color
, alpha
, linestyle
(or ls
), and linewidth
(or lw
).
# Creating a subset for diamonds
diamonds = df[(df['shape']=='diamond')&
(df['year']>=2015)&(df['year']<=2019)].loc[:, 'city']
# Creating a list of subsets
subsets=[set(crosses), set(cigars), set(diamonds)]
fig = plt.subplots(figsize=(15,10))
# Creating a Venn diagram for the 3 subsets
venn3(subsets=subsets,
set_labels=['Crosses', 'Cigars', 'Diamonds'],
set_colors=['magenta', 'dodgerblue', 'gold'],
alpha=0.3)
# Customizing the circumferences of the circles
venn3_circles(subsets=subsets,
color='darkviolet', alpha=0.9,
ls='dotted', lw=4)
plt.title('Crosses, cigars, and diamonds nby number of cities, 2015-2019', fontsize=26)
plt.show()

Hence, in the period of interest there were 6 cities in North America where all 3 shapes were registered, 66 cities – where only cigars and diamonds, 260 – only diamonds, etc. Let’s check those 6 cities in common for all the 3 shapes:
print(set(crosses) & set(cigars) & set(diamonds))
Output:
{'Albuquerque', 'Rochester', 'Staten Island', 'Lakewood', 'Savannah', 'New York'}
All of them are located in the USA.
Venn Diagrams can be further beautified through the get_patch_by_id()
method. It allows us to select any of the diagram zones by its id and change its color (set_color()
), transparency (set_alpha()
), change the text (set_text()
) and adjust its font size (set_fontsize()
). The possible values of id for a two-circle Venn diagram are '10'
, '01'
, '11'
, for a three-circle one – '100'
, '010'
, '001'
, '110'
, '101'
, '011'
, '111'
. The logic behind these values is the following:
- the number of digits reflects the number of circles,
- each digit represents a dataset (subset) in the order of their assignment,
- 1 means the presence of a dataset in the zone, while 0 – the absence.
For example, '101'
is related to the zone where the 1st and 3rd datasets are present, and the 2nd is absent in a three-circle diagram, i.e. to the intersection of the 1st and the 3rd circles excluding the 2nd one. In our case, it’s the crosses-diamonds intersection, which is equal to 9 cities where only these two shapes were observed in the period of interest.
Let’s try to change the color of the intersection zones of our Venn diagram and add short pieces of text instead of numbers to the zones representing only one shape. Furthermore, to make it funnier, let it be not just a boring text, but some ASCII art symbols reflecting each shape:
fig = plt.subplots(figsize=(15,10))
# Assigning the Venn diagram to a variable
v = venn3(subsets=subsets,
set_labels=['Crosses', 'Cigars', 'Diamonds'],
set_colors=['magenta', 'dodgerblue', 'gold'],
alpha=0.3)
# Changing the color of the intersection zones
v.get_patch_by_id('111').set_color('white')
v.get_patch_by_id('110').set_color('lightgrey')
v.get_patch_by_id('101').set_color('lightgrey')
v.get_patch_by_id('011').set_color('lightgrey')
# Changing text and font size
v.get_label_by_id('100').set_text('✠')
v.get_label_by_id('100').set_fontsize(25)
v.get_label_by_id('010').set_text('(Ì…_Ì…_Ì…_Ì…(Ì…_Ì…_Ì…_Ì…_Ì…_Ì…_Ì…_Ì…_Ì…Ì…_Ì…()~~~')
v.get_label_by_id('010').set_fontsize(9)
v.get_label_by_id('001').set_text('â™›')
v.get_label_by_id('001').set_fontsize(35)
# Customizing the circumferences of the circles
venn3_circles(subsets=subsets,
color='darkviolet', alpha=0.9,
ls='dotted', lw=4)
plt.title('Crosses, cigars, and diamonds nby number of cities, 2015-2019', fontsize=26)
plt.show()

Finally, it’s possible to adjust any of the circles separately, assigning the result of the venn3_circles()
method to a variable and then referring to the circles by index (0
, 1
, or 2
, in case of a three-circle Venn diagram). The methods to be used here are self-explanatory and similar to the ones discussed above: set_color()
, set_edgecolor()
, set_alpha()
, set_ls()
, and set_lw()
.
Let’s emphasize the circle for diamonds (everybody likes diamonds! 🙂💎 )
##### PREVIOUS CODE #####
fig = plt.subplots(figsize=(15,10))
# Assigning the Venn diagram to a variable
v = venn3(subsets=subsets,
set_labels=['Crosses', 'Cigars', 'Diamonds'],
set_colors=['magenta', 'dodgerblue', 'gold'],
alpha=0.3)
# Changing the color of the intersection zones v.get_patch_by_id('111').set_color('white') v.get_patch_by_id('110').set_color('lightgrey') v.get_patch_by_id('101').set_color('lightgrey') v.get_patch_by_id('011').set_color('lightgrey')
# Changing text and font size
v.get_label_by_id('100').set_text('✠') v.get_label_by_id('100').set_fontsize(25) v.get_label_by_id('010').set_text('(̅_̅_̅_̅(̅_̅_̅_̅_̅_̅_̅_̅_̅̅_̅()~~~')
v.get_label_by_id('010').set_fontsize(9) v.get_label_by_id('001').set_text('â™›') v.get_label_by_id('001').set_fontsize(35)
##### NEW CODE #####
# Assigning the Venn diagram circles to a variable
c = venn3_circles(subsets=subsets,
color='darkviolet', alpha=0.9,
ls='dotted', lw=4)
# Changing the circle for diamonds by index
c[2].set_color('gold')
c[2].set_edgecolor('darkgoldenrod')
c[2].set_alpha(0.6)
c[2].set_ls('dashed')
c[2].set_lw(6)
plt.title('Crosses, cigars, and diamonds nby number of cities, 2015-2019', fontsize=26)
plt.show()

5. Swarm Plot
While its more famous "relative" box plot is great at displaying the overall distribution statistics, and the less known violin plot describes the distribution of the data for one or several categories, the under-estimated swarm plot provides some additional information about the dataset. Namely, it gives us an idea of:
- the sample size,
- the overall distribution of a numeric variable across one or more categories,
- where exactly the individual observations are located in the distribution.
The points in a swarm plot are adjusted along the categorical axis in a way to be close to each other but not to overlap. Consequently, this plot works well only in the case of a relatively small number of data points, while for larger samples violin plots are more suitable (for them, just the opposite, a sufficient number of data points is required to avoid misleading estimations). Also, as we’ll see soon, swarm plots are good for distinguishing individual data points from different groups (optimal no more than 3 groups).
A swarm plot can be a good alternative or supplement to a box plot or a violin plot.
Let’s extract a couple of relatively small subsets from our UFO dataset, create for them swarm plots, and compare them with box and violin plots. In particular, we can select one state from the USA and one from Canada, extract all the UFO sightings of conic or cylindric shapes for both, and observe the corresponding data point distribution along the years (from 1969 till 2019). From our treemap experiments, we remember that the biggest number of UFO sightings in the USA was registered in California. Let’s now find the leader in Canada:
df[df['country']=='Canada'].loc[:, 'state'].value_counts()[:3]
Output:
ON 1363
BC 451
AB 369
Name: state, dtype: int64
So, we’ll select California from the USA and Ontario from Canada as the candidates for our further plotting. First, let’s extract the data and create for it basic swarm plots, superimposed on the corresponding box plots for comparison:
# Extracting the data for cylinders and cones
# from California and Ontario
CA_ON_cyl_con = df[((df['state']=='CA')|(df['state']=='ON'))&
(df['shape']=='cylinder')|(df['shape']=='cone'))]
fig = plt.subplots(figsize=(12,7))
sns.set_theme(style='white')
# Creating swarm plots
sns.swarmplot(data=CA_ON_cyl_con,
x='year', y='state',
palette=['deeppink', 'blue'])
# Creating box plots
sns.boxplot(data=CA_ON_cyl_con,
x='year', y='state',
palette=['palegreen', 'lemonchiffon'])
plt.title('Cylinders and cones in California and Ontario',
fontsize=29)
plt.xlabel('Years', fontsize=18)
plt.ylabel('States', fontsize=18)
sns.despine()
plt.show()

We can make the following observations here:
- Since the numeric variable in question (
year
) is an integer, the data points are aligned. - Both subsets are quite different in terms of their sample size. It’s clearly seen on the swarm plots, while the box plots hide this information.
- The Californian subset is heavily left-skewed and contains a lot of outliers.
- None of the box plots gives us an idea about the underlying data distributions. In the case of the Californian subset, the swarm plot shows that there are a lot of conic or cylindric UFO related to the 3rd quartile of the distribution, as well as to the most recent year, 2019.
- We definitely should add to our "wish list" the possibility to distinguish between cylinders and cones for each subset.
So, our next steps will be:
- to exclude the outliers from the visualization and zoom it in on the x-axis,
- to add the
hue
parameter to the swarm plots, to be able to display the second categorical variable (shape
).
fig = plt.subplots(figsize=(12,7))
# Creating swarm plots
sns.swarmplot(data=CA_ON_cyl_con,
x='year', y='state',
palette=['deeppink', 'blue'],
hue='shape')
# Creating box plots
sns.boxplot(data=CA_ON_cyl_con,
x='year', y='state',
palette=['palegreen', 'lemonchiffon'])
plt.title('Cylinders and cones in California and Ontario',
fontsize=29)
plt.xlim(1997,2020)
plt.xlabel('Years', fontsize=18)
plt.ylabel('States', fontsize=18)
plt.legend(loc='upper left', frameon=False, fontsize=15)
sns.despine()
plt.show()

Now both swarm plots show that the predominant majority of UFO for these 2 subsets are cylinders. For the Californian subset, we can distinguish the years of particularly frequent occurences of cylindric/conic UFO: 2008, 2015, and 2019. Moreover, in 2015, we observe an unexpected boom of cones, despite they are much rarer in general.
Let’s now put apart box plots and compare swarm and violin plots for each subset. This time, though, we’ll customize the swarm plots a bit more, using some of the parameters below:
order
,hue_order
– the order to plot the categorical variables in. If we create a swarm-box hybrid plot like above (or swarm-violin), we have to apply this order also to the second type of plot.dodge
– assigning it toTrue
will separate the strips for different hue levels (if applicable) along the categorical axis.marker
,color
,alpha
,size
,edgecolor
,linewidth
– marker style ('o'
by default), color, transparency, radius (5 by default), edge color ('gray'
by default), and edge width (0 by default).cmap
– a colormap name.
fig = plt.subplots(figsize=(12,7))
# Creating and customizing swarm plots
sns.swarmplot(data=CA_ON_cyl_con,
x='year', y='state',
palette=['deeppink', 'blue'],
hue='shape',
marker='D',
size = 8,
edgecolor='green',
linewidth = 0.8)
# Creating violin plots
sns.violinplot(data=CA_ON_cyl_con,
x='year', y='state',
palette=['palegreen', 'lemonchiffon'])
plt.title('Cylinders and cones in California and Ontario', fontsize=29)
plt.xlim(1997,2020)
plt.xlabel('Years', fontsize=18)
plt.ylabel('States', fontsize=18)
plt.legend(loc='upper left', frameon=False, fontsize=15)
sns.despine()
plt.show()

Here we can make the following observations:
- As it was with the box plots, the violin plots don’t reflect the sample size of both subsets.
- The violin plots don’t distinguish between cylinders and cones.
We could resolve the last issue by creating instead grouped violin plots (using the parameters split
and hue
). However, given that our subsets are already rather small, splitting them for creating grouped violin plots would lead to further decreasing of the sample size and data density of each part, making these plots even less representative. Hence, in such cases, swarm plots look a better choice.
Conclusion
To sum up, we’ve explored five rarely used plot types, their application cases, limitations, alternatives, ways of customization, and the approaches to analyze the resulting graphs. Besides, we’ve investigated a little bit the mysterious world of UFOs.
If by any chance, there are some extraterrestrial beings reading this right now, then I would like to thank them for visiting our planet every now and again. Please next time come also to my country, probably I will be able to visualize you better 👽🎨 .
Thank you, dear reader, for your attention. I hope you enjoyed my article and found something useful for you.
If you liked this article, you can also find interesting the following ones:
How To Read Your Horoscope in Python