Venn Diagram: A Not So Common Visualization Tool

Examples of using Venn diagram to aid the EDA process

Elena V Kazakova
Towards Data Science
6 min readJun 8, 2021

--

Introduction

A Venn diagram is a schematic representation of all the possible relationships (union, intersection, difference, symmetric difference) of several subsets of a universal set.

Venn diagrams (sometimes called Euler-Venn diagrams) have been around as a tool for logic problem solving since at least the 1880s. The concept has been introduced by John Venn in his Symbolic Logic, Chapter V, “Diagrammatic Representation.” Leonard Euler (hello Prof. Euler, we love you) and Christian Weise used a similar approach in their work even earlier. However, contrary to a popular belief (or not so popular, not everyone lives and breathes logic set theory), Euler’s version of the diagram is not the same as Venn’s. Euler’s diagrams are sometimes called Euler circles (no, they are not always circular), and they were first created to address Aristotle syllogistics while Venn used his to solve mathematical logic problems.

Here’s a simple example of a Data Science Venn diagram. There are numerous versions of it, but I like this one the most:

Data Science Venn diagram, image by Author

I would also add ‘Psychology Knowledge’ and ‘Artistry Skills’ to it but let’s keep it simple. Footnote: yes, data scientists are unicorns; let’s embrace it.

That is probably enough wiki level content. Now, off we go to explore the more practical side of Venn diagrams; their usage in data visualization.

Use cases

Example #1:

Recently I faced the problem of understanding how certain features in a classification problem relate to each other. The features in the dataset were linked to the socioeconomic statuses of Colombian engineering students. A final goal of the project was to model a student’s performance in professional tests. Most of the features in the dataset were categorical and included, among other things, quality of housing, family income, and socioeconomic levels. Other categories that were explored included whether
someone owned a computer, a cell phone, had access to fresh water, and even whether they owned a microwave. Common sense lead me to believe that whether or not someone owned a washing machine would largely overlap with the socioeconomic level of a student’s family. Likewise, access to fresh water would be closely tied to the quality of housing. However, I couldn’t be certain that my assumptions about features replicating each other would be correct. Sure, I could’ve used a correlation matrix, but I find that color differences are more difficult to assess than size differences. Besides, I wanted an intuitive way of visualizing a proportion of records with specific features in the overall pool of records. Venn diagrams seemed like a good tool for what I needed. Here’s how it went.

I started with installing the matplotlib-venn package. I prefer doing installations through conda:

conda install -c conda-forge matplotlib-venn

or pip install will do too

!pip install matplotlib-venn

Followed by importing:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib_venn as venn
from matplotlib_venn import venn2, venn2_circles, venn3, venn3_circles

And finally the magic one

%matplotlib inline

The next step was importing the dataset. I had it partially preprocessed already; some records with errors were removed, values were cleaned up, and datatypes were changed from numeric to object. The final dataset can be found in my github repo, data_for_venn_clean.csv.

I continued with creating sets based on conditions to be used with venn2 and venn3. The full code snippet can be accessed in the same directory as a Jupyter notebook or PDF venn_diagram_examples.ipynb/venn_diagram_examples.pdf. Here is an example of a few of these datasets:

TV_set=set(df_venn.loc[df_venn.TV=='Yes']['COD_S11'])
Internet_set=set(df_venn.loc[df_venn.INTERNET=='Yes']['COD_S11'])
...
SEL1_set=set(df_venn.loc[df_venn.SEL==1]['COD_S11'])
SEL2_set=set(df_venn.loc[df_venn.SEL==2]['COD_S11'])
SEL3_set=set(df_venn.loc[df_venn.SEL==3]['COD_S11'])
SEL4_set=set(df_venn.loc[df_venn.SEL==4]['COD_S11'])

I have also built several union sets:

SEL1_and_SEL2_set=SEL_IHE1_set.union(SEL_IHE2_set)
SEL3_and_SEL4_set=SEL_IHE3_set.union(SEL_IHE4_set)

To save space and time, I‘ll refer anyone interested in the full description of the data variables to the README.md file in the same repo.

Visualization, Internet access versus SocioEconomic level:

plt.figure(figsize=(10, 10))SEL1_and_SEL2_set=SEL_IHE1_set.union(SEL_IHE2_set)
SEL3_and_SEL4_set=SEL_IHE3_set.union(SEL_IHE4_set)
ax=plt.gca()sets=[Internet_set, SEL1_and_SEL2_set, SEL3_and_SEL4_set]
labels=('Internet access', 'SocioEconomic level 1 or 2', 'SocioEconomic level 3 or 4')
v=venn3(subsets=sets, set_labels = labels, ax=ax, set_colors=("orange", "blue", "red"))v.get_patch_by_id('100').set_alpha(0.3)venn3_circles(subsets=sets,
linestyle="dashed", linewidth=1)
plt.annotate('SocioEconomic level 1 or 2\nHas no internet\n36%',
xy=v.get_label_by_id('010').get_position() - np.array([0, 0.05]), xytext=(-130,-130),
ha='center', textcoords='offset points', bbox=dict(boxstyle='round, pad=0.5', fc='gray', alpha=0.1),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.4',color='gray'))
plt.annotate('SocioEconomic level 3 or 4\nHas no internet\n9%',
xy=v.get_label_by_id('001').get_position() - np.array([0, -0.05]), xytext=(190,190),
ha='center', textcoords='offset points', bbox=dict(boxstyle='round, pad=0.5', fc='gray', alpha=0.1),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.3',color='gray'))
plt.title('Internet access vs SocioEconomic level', fontsize='20')
plt.show()
Image by Author

From the diagram above, it’s obvious that the percentage of students with no internet access is almost 3.5 times higher among student from low socioeconomic level families than among students from wealthier families. The same pattern is repeated when it comes to ownership of a computer. In fact, internet access and computer ownership were essentially overlapping:

Image by Author

The diagrams led to my conclusion that there is, indeed, a strong correlation between socioeconomic level and access to the internet/a computer. Also, I learned that owning a computer correlates with internet access availability.

I used the same approach to visualize the relationship between access to fresh water and the quality of housing. Colombia uses strata (estrato) to categorize neighborhoods and areas based on several criteria. These numbers are intended to classify properties and other constructions, not necessarily the social status of the people who live there, so income is not factored into the assessment. The generated diagram leads to the conclusion that the quality of housing strongly correlates with access to fresh water. The percentage of students from higher strata families were six times more likely to have access to fresh water than their peers from lower strata families.

Image by Author

The visualizations above helped me answer the question “should I leave all of the features in the dataset for building models or should I drop some of them?”

Example #2

I also found Venn diagrams helpful for understanding NLP processing prior to vectorization.

I won’t show the code here, but anyone interested can find it in the same github repo along with the csv file, ‘satire_nosatire.csv’. It’s a dataset that includes both satirical (The Onion) and real news (Reuters) articles. The entire set of articles is referred to as the corpus. It’s pre-processed and split into two sets, satire and news. Below is a Venn diagram visualizing the sets of satire and news words in the training dataset.

Image by Author

Conclusion

The goal of this post was to demonstrate how Venn diagrams can be used to aid in understanding your data in two different domains-big categorical datasets and unstructured NLP data-before building models. I sincerely hope that you will find this to be helpful in your future data analysis projects.

--

--