Hands-on Tutorials

Health passport, also known as Green Passport, Sanitary Passport or Vaccine Passport, was the result of the _COVID-19 pandemic_ that started in 2019. The health passport has caused a wide controversy around the world. When various governments consider it a solution to limit the spread of the virus, a lot of people and groups stand firmly against it and view it as a human right violation. As a result, I decided to train my skills in data analysis, specifically in natural language processing (NLP) and data visualisation, on tweets that talked about the health passport.
In order to start with this project, it was important to create a twitter developer account and acquire keys and tokens to retrieve data through twitter. This article will not go into details of the steps of the tweets extraction; yet, it is worth mentioning that the tweets extracted were only in english and the search query used was as follows:
search_query = """ "vaccine passport" OR "vaccine pass" OR
"pass sanitaire" OR "sanitary pass" OR
"sanitary passport" OR "health pass" OR
"covid passport" OR "covid pass"
-filter:retweets -filter:media """
This search_query serves to search for any tweet that contains one or more of the above keywords excluding retweets and tweets with media.
The dataset created looked as follows:

For this project I was interested in the following three columns:
- user_location column – to check and visualise the number of tweets per country
- date – to check the dates of the tweets
- text – to do text analysis through unigrams, bigrams and trigrams
Date Analysis
The analysis of the date column was rather a simple task to do. In order to verify the dates of the tweets, time was removed from the date column using the following line of code:
tweets['date'] = pd.to_datetime(tweets['date']).dt.date
Then, the counts of the unique dates were displayed using the _value_counts()_ method.
tweets['date'].value_counts()
The Output:

Location Analysis
The objective of the location analysis is to obtain a general overview from where the tweets originated by calculating the number of tweets per country. In order to achieve that, it was necessary to do some preprocessing on the userlocation column. Also, matplotlib.pyplot and geopandas_ were used to visualise the results in the form of a piechart and geospatial map.
Preprocessing the user_location
Preprocessing the user_location column is an important task to extract the country names from the data found in this column.
Some of the data in the user_location column do not make any sense such as ‘Lionel Messi’s Trophy Room’ and ‘Where are you’, therefore, the first step done was to remove any content that is not location. This was achieved through named entity recognition using the spaCy library.
# NLP
import spacy
nlp = spacy.load('en_core_web_sm')
# create a list of raw locations - i.e. locations entered by users
raw_locations = tweets.user_location.unique().tolist()
# replace nan by "" - the first element of the list is nan
raw_locations[0] = ""
# locations list will only include relevant locations
locations = []
# search for relevant locations and add them to the locations list
for loc in raw_locations:
text = ""
loc_ = nlp(loc)
for ent in loc_.ents:
if ent.label_ == "GPE":
text = text + " " + ent.text
locations.append(text)
else:
continue
The above code served to remove all content that are meaningless in terms of location; yet, the user_location column not only includes country names, but also includes cities and states such as ‘London’ or ‘New York, NY’. As a result, I decided to use the geopy library to obtain the country name from cities’ and states’ names.
# Geocoding Webservices
from geopy.geocoders import Nominatim
# NLP
import spacy
nlp = spacy.load('en_core_web_sm')
# Get the country name from cities' and states' names
countries = []
for loc in locations:
geolocator = Nominatim(user_agent = "geoapiExercises")
location = geolocator.geocode(loc, timeout=10000)
if location == None:
countries.append(loc)
continue
location_ = nlp(location.address)
if "," in location_.text:
# Example of a location_.text: "New York, United States"
# get the name after the last ","
countries.append(location_.text.split(",")[-1])
else:
countries.append(location.address)
To display the country names, I used the np.unique() method.
np.unique(np.array(countries))
The output:

As you can obviously notice, some of the results are in languages other than english, and some of them are in more than one language separated by "/" or "-". Additionally, some of the results still do not indicate a country name such as ‘Toronto Hargeisa’ and ‘Detroit Las Vegas’. I tackled these issues by removing the "/" and "-" from the texts, and using the last name found after these icons. I also manually replaced some location names by their relevant country names. Finally, I used googletrans library to automatically translate the non-english country names into english. Note that I left locations that contain cities from different countries (e.g. ‘London Bxl Paris’) unchanged. Here is the full code of the above steps:
# Translation
from googletrans import Translator
# get the last name only when "/" or "-" is found
# "/" and "-" separates names in different languages
countries = [country.split("/")[-1] for country in countries]
countries = [country.split("-")[-1] for country in countries]
# remove white space found at the beginning of a string
countries = [country[1:] if country[0] == " " else country for country in countries]
# Manually replace locations to their relevant country name
countries = ['United States' if country in
["LA", "Detroit Las Vegas", "Atlanta Seattle"]
else country for country in countries]
countries = ['Canada' if country in
["Calgary Mohkinstis", "Calgary Mohkinstis Alberta"]
else country for country in countries]
# translate countries in foreign language to english
translator = Translator()
countries = [translator.translate(country).text
for country in countries]
print(len(countries))
print(np.unique(np.array(countries)))
The Output:

Unfortunately, there were still some country names that were not translated properly, and, thus, I had to manually replace them with their english version name.
# for those that were not translated properly add them manually
# unknown to be added to others later: joke To L Poo l ދިވެހިރާއްޖެ
countries = ['Germany' if country == "Deutschland"
else country for country in countries]
countries = ['Spain' if country == "España"
else country for country in countries]
countries = ['Iceland' if country == "Ísland"
else country for country in countries]
countries = ['Greece' if country == "Ελλάς"
else country for country in countries]
countries = ['Ukraine' if country == "Україна"
else country for country in countries]
countries = ['Iran' if country == "ایران"
else country for country in countries]
countries = ['Japan' if country == "日本"
else country for country in countries]
countries = ['Svizra' if country == "Switzerland"
else country for country in countries]
countries = ['Polska' if country == "Poland"
else country for country in countries]
countries = ["The Democratic People's Republic of Korea"
if country == "조선민주주의인민공화국"
else country for country in countries]
Finally, I created two dictionaries: 1) _countriesvalues dictionary which stores all country names as keys and the number of tweets per country as a value, and 2) _maincountries dictionary that stores the top countries in number of tweets and groups all other countries under a key called ‘others’.
from collections import Counter
# Use Counter to create a dictionary with all countries
# and their equivalent number of tweets
countries_values = Counter(countries)
# Create dictionary of the countries having the most tweets
# "others" key represent all other countries
main_countries = {'others': 0}
other_countries = []
for key, val in countries_values.items():
if val >= 20:
main_countries[key] = val
else:
main_countries["others"] += val
other_countries.append(key)
Plotting the Results using matplotlib.pyplot and geopandas
After preprocessing the userlocation column to get the locations as a country name, I decided to use matplotlib.pyplot and geopandas to visualise the results in the form of a map and a pie chart. For plotting the pie chart it is sufficient to only use the dictionary ‘maincountries’ created previously. However, for displaying a map that shows the number of tweets per country it is important to create a GeoDataFrame.
Creating the GeoDataFrame
The first step into creating the GeoDataFrame was creating a DataFrame from the _’countriesvalues’ dictionary which stores the isoalpha3 codes for each country. pycountry_ library was used to obtain the iso codes.
# Library to get the iso codes of the countries
import pycountry
# create a DataFrame of country names and their codes
df_countries = pd.DataFrame()
df_countries["country_name"] = list(countries_values.keys())
df_countries["country_value"] = list(countries_values.values())
def get_cntry_code(column):
CODE=[]
for country in column:
try:
code=pycountry.countries.get(name=country).alpha_3
# .alpha_3 means 3-letter country code
# .alpha_2 means 2-letter country code
CODE.append(code)
except:
CODE.append('None')
return CODE
# create a column for code
df_countries["country_code"] =
get_cntry_code(df_countries.country_name)
df_countries.head()

The second step was to load the world GeoDataFrame using the geopandas library as follows:
# Geospatial Data
import geopandas
world = geopandas.read_file(
geopandas.datasets.get_path('naturalearth_lowres'))
world

It is important to note that some countries’ iso_a3 are set as -99, and therefore, I had to manually update the codes as seen below:
world.loc[world['name'] == 'France', 'iso_a3'] = 'FRA'
world.loc[world['name'] == 'Norway', 'iso_a3'] = 'NOR'
world.loc[world['name'] == 'N. Cyprus', 'iso_a3'] = 'CYP'
world.loc[world['name'] == 'Somaliland', 'iso_a3'] = 'SOM'
world.loc[world['name'] == 'Kosovo', 'iso_a3'] = 'RKS'
To finalise the GeoDataFrame, the world GeoDataFrame and the countries DataFrame _(countriesdf) were merged on the "_countrycode" (i.e. the "iso_a3").
# rename columns to merge the DataFrames
world = world.rename(columns={"iso_a3": "country_code"})
df_merged = pd.merge(world, df_countries,
on="country_code", how='outer')
# fill empty value with zero
# any country with no tweet will have a value of 0
df_merged.country_value = df_merged.country_value.fillna(0)
Plotting the Map and Pie Chart
The below code shows the steps taken to plot the Map and the Pie Chart.
# Geospatial Data
import geopandas
# Data Visualisation
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig = plt.figure(figsize=(15,10), facecolor='#eef3f8')
ax1 = fig.add_axes([0, 0, 1, 1]) # the map
ax2 = fig.add_axes([0, 0.1, 0.2, 0.2]) # the pie chart
divider = divider = make_axes_locatable(ax1)
cax = divider.append_axes("right", size="2%", pad=0) # legend of map
### MAP ###
df_merged = df_merged[(df_merged.name != "Antarctica") &
(df_merged.name != "Fr. S. Antarctic Lands")]
df_merged.to_crs(epsg=4326, inplace=True)
df_merged.plot(column='country_value', cmap='Greens',
linewidth=1.0, ax=ax1, edgecolor='0.8',
legend=True, cax=cax)
# remove axis
ax1.axis('off')
# add a title of the map
font_t1 = {'fontsize':'20', 'fontweight':'bold', 'color':'#065535'}
ax1.set_title('NUMBER OF TWEETS PER COUNTRY',
fontdict=font_t1, pad=24)
### PIE CHART ###
total = len(countries) # 1241
# Sort the dictionary - this is only for the sake of visualisation
sorted_main_countries = dict(sorted(main_countries.items(),
key=lambda x: x[1]))
# get labels for pie chart
labels = list(sorted_main_countries.keys())
# get percentages for pie chart
percentages = [(val/total)*100
for val in list(sorted_main_countries.values())]
# get theme colors to be coherent with that of the map's colors
theme = plt.get_cmap('Greens')
ax2.set_prop_cycle("color", [theme(1. * i / len(percentages))
for i in range(len(percentages))])
wedgeprops = {'linewidth':1, 'edgecolor':'white'}
_, texts, autotexts = ax2.pie(percentages, labels=labels,
labeldistance=1.1, autopct='%.0f%%',
radius=1.8, wedgeprops=wedgeprops)
# set color for the autotext
for auto in autotexts:
auto.set_color('black')
# set color of the labels
for text in texts:
text.set_color('#909994')
# add a title for the pie chart
font_t2 = {'fontsize':'12', 'fontweight':'bold', 'color':'#065535'}
ax2.set_title('Percent of Tweets by Country', fontdict=font_t2,
pad=60, style='oblique')
# save figure
plt.savefig("no_of_tweets_per_country.png",
bbox_inches = 'tight', facecolor='#eef3f8')
plt.show()
The Final Result:

The result indicates that most tweets come from english speaking countries (United States, United Kingdom, Canada, Australia and South Africa). The reason behind this can be because the tweets collected were only in english. Perhaps if other languages were selected then the results might differ. France is one of the strictest European countries in applying the Sanitary Pass, and manifestations against the concept of "Pass Sanitaire" are conducted regularly, which can explain why France came in the top countries even though it is not an English speaking country.
Text Analysis
The text analysis focused on obtaining the unigrams, bigrams and trigrams of the tweets. Unigrams were visualised through a world cloud, where as bigrams and trigrams were displayed on a bar chart. These are achieved by applying TF_IDF (Term Frequency-Inverse Document Frequency). The goal is to get a glimpse into the most used terms in the tweets, thus, hopefully, helping us, in general, to understand where the tweeters stand.
Text Preprocessing
Before obtaining the unigrams, bigrams and trigrams, it is important to do text preprocessing. Below, I list the steps taken to preprocess the text along with the full code which is shown at the end of this section
1- Convert to lower case
2- Remove URLs
3- Convert slang to their original form (for this to be achieved I had to scrape this website-details of web scraping will not be mentioned in this article)
4- Remove mentions
5- Remove punctuation
6- Lemmatise
7- Remove stop-words
8- Remove numbers (emoticons are translated into numbers. To avoid getting in the top n-grams numbers, I decided to remove numbers. However, one can decide to keep them and use emoticons as part of the analysis)
9- Remove countries’ and cities’ names (to avoid getting a country or a city name in the top n-grams, I decided to remove them from the tweets)
Full Code for the Text Preprocessing:
# NLP
import spacy
import regex as re
import string
nlp = spacy.load('en_core_web_sm')
tweets["text_processed"] = ""
# convert to lower case
tweets["text_processed"] = tweets["text"].apply(lambda x:
str.lower(x))
# remove URL
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
re.sub(
r"(?:@|http?://|https?://|www)S+",
' ', x))
# convert slang
# load slang dictionary
with open('slangdict.json') as json_file:
slang_dict=json.load(json_file)
for i in range(len(tweets["text_processed"])):
txt = ""
doc = nlp(tweets["text_processed"].iloc[i])
for token in doc:
if token.text in list(slang_dict.keys()):
txt = txt + " " + slang_dict[token.text]
else:
txt = txt + " " + token.text
tweets["text_processed"].iloc[i] = txt
# remove mentions
def remove_entities(text, entity_list):
for separator in string.punctuation:
if separator not in entity_list:
text = text.replace(separator,' ')
words = []
for word in text.split():
word = word.strip()
if word:
if word[0] not in entity_list:
words.append(word)
return ' '.join(words)
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
remove_entities(x, ["@"]))
# remove punctuation
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
re.sub(r'[^ws]',' ', x))
# lemmatize
def lemmatize(sentence):
doc = nlp(sentence) # tokenize the text and produce a Doc Object
lemmas = [token.lemma_ for token in doc]
return " ".join(lemmas)
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
lemmatize(x))
# remove stopwords
def remove_stopwords(sentence):
doc = nlp(sentence) # tokenize the text and produce a Doc Object
all_stopwords = nlp.Defaults.stop_words
doc_tokens = [token.text for token in doc]
tokens_without_sw = [word for word in doc_tokens
if not word in all_stopwords]
return " ".join(tokens_without_sw)
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
remove_stopwords(x))
# remove -PRON- a result of lemmatization
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
re.sub('-PRON-', " ", x))
# remove numbers
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
re.sub(r'[0-9]', " ", x))
# remove country and city names
def remove_country_city(sentence):
doc = nlp(sentence)
return (" ".join([ent.text for ent in doc if not ent.ent_type_]))
tweets["text_processed"] = tweets["text_processed"].apply(lambda x:
remove_country_city(x)
if pd.isna(x) != True else x)
Creating a word cloud from unigrams
As shown below, I decided to create a world cloud to display the most prominent words in the tweets. I achieved this by using the _TfidfVectorizer,_ in which the default ngram_range is (1,1) meaning unigram.
from PIL import Image
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
vectorizor = TfidfVectorizer(stop_words='english')
vecs = vectorizor.fit_transform(tweets.text_processed)
feature_names = vectorizor.get_feature_names()
dense = vecs.todense()
l = dense.tolist()
df = pd.DataFrame(l, columns=feature_names)
# mask is the image used to reshape the cloud
mask = np.array(Image.open('./images/syringe44_.jpeg'))
word_cloud = WordCloud(collocations=False, background_color='white',
max_words=200, width=3000,
height=2000, colormap='viridis',
mask=mask).generate_from_frequencies(
df.T.sum(axis=1))
plt.figure(figsize=[15,10])
plt.imshow(word_cloud)
plt.axis("off")
plt.show()
word_cloud.to_file("tfidf_ps_word_cloud.png")
The word cloud:

As expected, words as ‘passport’, ‘vaccine’, ‘vaccination’, ‘pass’ and ‘covid’ are very abundant. This is normal, since these are the topic of the tweets; additionally, the search query when extracting the tweets focused on these keywords. However, if we look closer into the word cloud we can notice other words of interest that can be useful for further analysis such as ‘protest’, ‘stop’, ‘enforce’, ‘refuse’, ‘right’ and ‘mandate’.
In Unigrams the occurrence of each word is considered independent of the one preceding it, which does not always makes it the best option for text analysis. Therefore, I decided to move further and check the bigrams and trigrams to see if they can provide more insight.
Obtaining and visualising bigrams and trigrams
Steps to acquire the unigrams, bigrams and trigrams are very similar. The only argument to update is the ngramrange in the TfidfVectorizer_, where for unigrams it is (1,1) (the default), for bigrams it is (2,2) and for trigrams it is (3,3).
from sklearn.feature_extraction.text import TfidfVectorizer
# bigrams
vectorizor = TfidfVectorizer(stop_words='english',
ngram_range =(2, 2))
vecs = vectorizor.fit_transform(tweets.text_processed)
feature_names = vectorizor.get_feature_names()
dense = vecs.todense()
l = dense.tolist()
df = pd.DataFrame(l, columns=feature_names)
n_grams = df.T.sum(axis=1).sort_values(ascending=False)
n_grams.to_csv("bigrams.csv")
I used a bar chart to visualise the top 100 bigrams and trigrams. _matplotlib.pyplot and seaborn_ libraries are essential for this visualisation. Here is an example of the code to visualise the data:
# read bigram dataset
bigrams = pd.read_csv("bigrams.csv")
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# plot bar graph
plt.figure(figsize=(10,20))
sns.barplot(x = bigrams[:101].value , y=bigrams[:101].bigram)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.xlabel("Value",
fontdict={'fontsize':'12', 'fontweight':'bold'})
plt.ylabel("Bigrams",
fontdict={'fontsize':'12', 'fontweight':'bold'})
plt.title("Top 100 Bigrams",
fontdict={'fontsize':'16', 'fontweight':'bold'})
plt.savefig("Top 100 Bigrams.png", bbox_inches='tight')
The results:


As in the world cloud case, it is very understandable to have in the first ranks terms such as ‘vaccine passport’, ‘covid pass’ and ‘health pass’. However if we look further, we can find other bi- and tri-grams that might be useful for the analysis. For example some intriguing bigrams may be ‘herd immunity’ , ‘pass protestor’, ‘fake vaccine’, ‘kill population’, ‘spread virus’ and ‘fully vaccinated’. Whereas, trigrams of interest may be ‘real threat people’ and ‘anti vaccine passport’. That said, further deeper analysis is needed to reach an accurate conclusion. In addition, I recommend to look further than the top 100 bi- and tri-grams, and to apply sentiment analysis on the n-grams.
In this article, I explained the steps taken to conduct data analysis on tweets that talked about the Health Passport. To achieve this project several natural language processing techniques were used such as named entity recognition, text preprocessing and text analysis through unigrams, bigrams and trigrams. Additionally, results were displayed through data visualisation in the form of pie charts, geospatial maps, bar charts and world clouds.