Getting Started
Context-specific Pre-processing for NLP with spaCy: Tweets
With tweets, these include mentions, hashtags and URLs.

Natural Language Processing is a field of machine learning concerned with understanding human language. As opposed to numerical data, NLP works primarily with text. Exploring and pre-processing text data requires different techniques and libraries, and this tutorial demonstrates the basics.
However, pre-processing is not an algorithmic procedure. With Data Science tasks, often the context of the data determines what aspects of the data are valuable, and what aspects are irrelevant or unreliable. In this tutorial, we explore text pre-processing in the context of tweets, or more generally, social media.
Kaggle’s 9-year-old Getting Started Real or Not? NLP with Disaster Tweets competition presents a reasonably-sized dataset (around 7500 tweets in the training set) for practice. The challenge is to classify tweets, given their text, keyword and location, into whether they are really about disasters or not.
The code for this tutorial can be followed at this notebook and repository.
Before we get started, download the NLP-getting-started
data from Kaggle. In my project directory, I put train.csv
, test.csv
, and sample_submission.csv
under a data
subdirectory.
Data Exploration
Let’s start by importing typical and useful data science libraries and creating a dataframe out of train.csv
. I won’t delve into the details of libraries that are not NLP-specific.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('data/train.csv', index_col='id')
data.head()
keyword location text target
id
1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
4 NaN NaN Forest fire near La Ronge Sask. Canada 1
5 NaN NaN All residents asked to 'shelter in place' are ... 1
6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
7 NaN NaN Just got sent this photo from Ruby #Alaska as ...
Our data comprises 4 columns, keyword
, location
, text
and target
. Quoting the data description on Kaggle:
id
– a unique identifier for each tweettext
– the text of the tweetlocation
– the location the tweet was sent from (may be blank)keyword
– a particular keyword from the tweet (may be blank)target
– in train.csv only, this denotes whether a tweet is about a real disaster (1
) or not (0
)
To ensure integrity with the number of rows and columns in our dataset, as well as make judgements on the generalizability of our training set, let’s understand the size of our training data.
data.shape
(7613, 4)
Examining closer, we find there are 52 duplicate rows (different id’s, but same keyword
, location
, text
and target
.
np.sum(data.duplicated()) # 52 duplicates
So let’s drop the duplicate rows. The index (set as id
) remains intact. After removing duplicate rows, we are left with 7561 tweets (integrity check, as mentioned earlier), a manageable amount for this tutorial.
However, 7561 data points is still relatively little for NLP, especially if we are using deep learning models. Given that there are probably close to a million tweets each day, I doubt a model trained on just 7561 data points is generalizable enough.
# let's drop the duplicates!
df = data
df = data.drop_duplicates()
df.shape
Our output:
(7561, 4)
Aside from training size, the balance of classes (target
) in training set is also important. A training set with all target
value = 0 will leave the model classifying every tweet as not about a disaster. The vice versa situation is also true. Ideally, the training set should have all classes represented equally.
We can use panda’s dataframe value_counts()
method to count the number of rows for each class. With 4322 tweets not about disasters (target=
0) and 3239 tweets about disasters (target
=1), we have a 4:3 class balance. That’s not perfectly balanced, but it’s not disastrously imbalanced.
# target 1 refers to disaster tweet, 0 is not a disaster tweet
df['target'].value_counts()
Our output:
0 4322
1 3239
Name: target, dtype: int64
Let’s also look into the completeness of data. Using we can sum up the series returned by panda’s dataframe isna()
method to count the number of na
entries for each column.
# checking for completeness of data
print(f"{np.sum(df['keyword'].isna())} rows have no keywords")
print(f"{np.sum(df['location'].isna())} rows have no location")
print(f"{np.sum(df['text'].isna())} rows have no text")
print(f"{np.sum(df['text'].isna())} rows have no target")
Our output:
61 rows have no keywords
2533 rows have no location
0 rows have no text
0 rows have no target
Ideally, we would further characterize and explore the data by analyzing the word lengths, sentence lengths, word frequencies and more. While that’s out of scope for this tutorial, you can learn more about it here.
NLP Pre-processing concepts
Now that we have explored the data, let’s pre-process the tweets and represent them in a form our models can take as input.
The most common numerical representation for texts is the bag-of-words representation.
Bag of words
Bag of words is a way to represent text data numerically. Text data is essentially split into words (or more accurately, tokens), which are features. The frequency of each word in each text data is are the corresponding feature values. For example, we might represent "I love cake very very much"
as a bag of words dictionary:
{
'I':1,
'love':1,
'cake':1,
'very':2,
'much':1
}
Tokenization
Tokenization breaks text data up into its tokens at the level of NLP (word, phrase, sentence). The lowest (and most common) is words, which fits perfectly with our bag of words representation. However, these tokens could also include punctuation, stop words, and other custom tokens. We’ll consider these in the context of tweets and the challenge in the next session.
Stemming
Stemming refers to truncating words of their affixes (prefixes or suffixes) to approximate them to their root form. This is often done with a lookup dictionary of prefixes and suffixes, making it computationally fast.
However, there is a performance trade-off. In the English language, some affixes change the meaning of the word completely, resulting in inaccurate feature representation.
Lemmatization
The alternative to stemming is lemmatization, where words are reduced to their lemmas, or root form. This is done using a lookup dictionary of words and their lemmas, hence resulting in it being more computationally expensive. However, performance is often better, since features are represented more accurately.
Given the relatively smaller size of our dataset, we will use lemmatization.
In the context of tweets
Getting from tweets to their bag of words representation is less straightforward. What about:
- words of different cases e.g. cake vs Cake,
- punctuation,
- stopwords,
- numbers,
- mentions,
- hashtags, and
- URLs?
When deciding what we want to do with these elements, we have to consider the context of the data and reconcile that with the challenge.
Words of different cases – standardize to lower-case
In internet lingo, different cases could communicate different sentiments (e.g. danger vs DANGER!) or different parts of speech (e.g. start of the sentence, or a pronoun like The Fight Club). By changing all tokens to upper- or lower- case, we could be losing data helpful for classification.
However, since we have a small dataset (7500 tweets), there is unlikely to be sufficient data of each upper-/lower-case variant, let’s go with lower-case.
Punctuation
Tweets are undoubtedly going to contain punctuations, which can convey different sentiments or emotions too. Consider, in internet lingo, the difference between:
- Help needed?
- Help needed!
We’ll consider punctuations each as their own tokens, with special cases like ‘…’ being a separate token from ‘.’. So we don’t lose data, we can disregard them (and even tune which punctuations to ignore) when tuning our hyper-parameters.
Stop words
Stop words are essentially words so common that they have little significant contribution to the meaning of the text. These words include articles (the, a, that) and other commonly used words (what, how, many).
Stop word tokens are often ignored in NLP processing. Plus, the character limit of a tweet (280 characters) often results in grammatically incorrect tweets, where articles are missed.
However, rather than ignore stop words from the start, let’s disregard them (and even tune which stop words to ignore) when tuning our hyper-parameters so we don’t lose data.
Numbers
Numbers in tweets can convey the quantity of literal objects, but can also convey the scale of something (e.g. 7.9 Richter scale earthquake) or the year (e.g. 2005 Hurricane Katrina).
In the latter two cases, such numerical information may be valuable depending on the level of NLP we choose to do later down the road (word-level vs phrase- or sentence- level), or if we want to filter tweets about historical disasters vs current disasters.
As such, we will retain numbers as tokens, with the option of ignoring them (or even only counting numbers that are years) when tuning our hyper-parameters.
Mentions
On Twitter, mentions allows users to address each other through a tweet. While mentions between personal accounts may be less significant, mentions to authorities to alert them of disasters (consider @policeauthorities, gun shooting down brick lane right now!).
Let’s tokenize mentions along with their usernames, but also count the number of mentions, which could convey a conversation.
Hashtags
Hashtags on Twitter allow users to discover content related to a specific theme or topic. When it comes to natural disasters, hashtags like #prayforCountryX and #RIPxyzShootings can differentiate tweets about disasters from everyday tweets.
As such, let’s tokenize hashtags with their content, but also count the number of hashtags. The number of hashtags could flag sensationalized social media marketing tweets (e.g. This beat drop is the bomb! #edm #music #dubstep #newrelease) that use disaster keywords.
URLs
Disaster tweets could include URLs to news articles, relief efforts, or images. However, the same can be said of everyday tweets. Since we’re unsure if disaster tweets are more likely to have URLs or a certain type of URL, let’s keep URLs as tokens and the number of URLs as a feature.
This challenge’s dataset features tweets with Twitter-shortened URLs (http://t.co), but more current tweet data could include the domains, which can then be extracted (I imagine the red cross domain would be highly correlated with disaster tweets). For more complicated algorithms, one can also consider visiting the shortened URL and scraping web page elements.
NLP with the spaCy library
spaCy is an open-source python library for natural language processing. It integrates well with the rest of the python machine learning libraries (scikit-learn, TensorFlow, PyTorch) and more, and uses a object-oriented approach to keep its interface readable and easy to use. Notably, it its model returns Document
type data, which consists of tokens with various useful annotations (e.g. its lemma, whether it’s a stopword) as attributes.
Let’s import Spacy, download the model for American English, and load the relevant spaCy model.
# download spaCy model for American English
!python3 -m spacy download en_core_web_sm
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
How well does out-of-the-box spaCy do with tweets?
Before we customize spaCy, we can take a look at how the out-of-the-box model tokenizes tweets with its default rules. I created a tweet that included a number, a contraction, a hashtag, a mention and a link.
As shown below, out-of-the-box spaCy already breaks up contractions and gives us the relevant lemmas. It also recognizes numbers, mentions and URLs as their own tokens according to the default rules. That leaves us with hashtags, which are split into a ‘#’ punctuation and hashtag content, instead of it staying as a whole token.
# Let's see what spaCy does with numbers, contractions, #hashtags, @mentions and URLs
s = "2020 can't get any worse #ihate2020 @bestfriend <https://t.co>"
doc = nlp(s)
# Let's look at the lemmas and is stopword of each token
print(f"TokenttLemmattStopword")
print("="*40)
for token in doc:
print(f"{token}tt{token.lemma_}tt{token.is_stop}"
This prints:
Token Lemma Stopword
========================================
2020 2020 False
ca can True
n't not True
get get True
any any True
worse bad False
# # False
ihate2020 ihate2020 False
@bestfriend @bestfriend False
<https://t.co> <https://t.co> False
Modifying spaCy
We can modify spaCy’s model to recognize hashtags as entire tokens.
spaCy’s tokenizer can be modified (you can also build a custom tokenizer if you want!) by redefining its default rules. spaCy’s tokenizer prioritizes rules in the following order: token match pattern, prefix, suffix, infixes, URL, special cases (see How spaCy’s Tokenizer Works).
In our case, we’ll modify the tokenizer’s pattern matching regex pattern (read more about regex here: A simple intro to Regex with Python) by appending '#w+'
, which is a pattern for the hash symbol and a word.
# We want to also keep #hashtags as a token, so we will modify the spaCy model's token_match
import re
# Retrieve the default token-matching regex pattern
re_token_match = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)
# Add #hashtag pattern
re_token_match = f"({re_token_match}|#w+)"
nlp.tokenizer.token_match = re.compile(re_token_match).match
# Now let's try again
s = "2020 can't get any worse #ihate2020 @bestfriend <https://t.co>"
doc = nlp(s)
# Let's look at the lemmas and is stopword of each token
print(f"TokenttLemmattStopword")
print("="*40)
for token in doc:
print(f"{token}tt{token.lemma_}tt{token.is_stop}")
Our code prints:
Token Lemma Stopword
========================================
2020 2020 False
ca can True
n't not True
get get True
any any True
worse bad False
#ihate2020 #ihate2020 False
@bestfriend @bestfriend False
<https://t.co> <https://t.co> False
Pre-processing algorithm
We can then proceed to create a preprocessing algorithm, and put it in a function so it can be called on every tweet in the training set. In the following preprocess
function, each tweet:
- Is changed to lower case
- Is tokenized with our modified spaCy model
- Has its lemmas of tokens set unioned with our
features
set - Has its bag-of-words representation constructed in a dictionary
freq
- Has its hashtags, mentions and URLs counted
# Create a pre-process function for each tweet
def preprocess(s, nlp, features):
"""
Given string s, spaCy model nlp, and set features (lemmas encountered),
pre-process s and return updated features and bag-of-words representation dict freq
- changes s to lower-case
- tokenize s using nlp to create a doc
- update features with lemmas encountered in s
- create bag-of-words representation in dict type freq, including counts for hashtags, mentions and URLs
"""
# To lowercase
s = s.lower()
# Creating a doc with spaCy
doc = nlp(s)
lemmas = []
for token in doc:
lemmas.append(token.lemma_)
# Union between lemmas and our features set
features |= set(lemmas)
# Constructing a bag of words for the tweet
freq = {'#':0,'@':0,'URL':0}
for word in lemmas:
freq[str(word)] = 0
for token in doc:
if '#' in str(token): freq['#'] += 1 # Count number of hashtags, regardless of hashtag
if '@' in str(token): freq['@'] += 1 # Count number of mentions, regardless of mention
if 'http://' in str(token): freq['URL'] += 1 # Count number of URLs, regardless of URL
freq[str(token.lemma_)] += 1
return features, freq
Setting it up
We’ll create a copy our de-duplicated data as a best practice, so that any pre-processing changes does not affect the original state of our training data. Then, we will initialize a python set features
, which will contain all features of each tweet. In addition to all lemmas encountered via tokenization of each tweet, features will include number of hashtags (#
), number of mentions (@
), and number of URLs (URL
).
preprocess_df = df # Duplicate for preprocessing
features = set({'#','@','URL'}) # Using set feature to contain all words (lemmas) seen
Using our preprocess
function, we’ll preprocess every tweet, updating features
each time with new lemmas seen. With each tweet, the tweet’s bag of words representation freq
is also appended to an array of bag of words for each tweet (bow_array
).
# Array bow_array of bow representations for each tweet;
# bow_array[i] is the bow representation for tweet id (i+1)
bow_array = []
for i in range(len(preprocess_df)):
features, freq = preprocess(preprocess_df.iloc[i]['text'],nlp,features)
bow_array.append(freq)
len(bow_array) # 7561
With all lemmas encountered across all tweets collected in features
, we can create a dataframe bow
to represent the features of all the tweets.
# Create dataframe for bag of words representation for each tweet
bow = pd.DataFrame('0', columns=features,index=range(len(preprocess_df)))
bow['id']=preprocess_df.index
bow.set_index('id',drop=True,inplace=True)
Now, let’s update our dataframe with the feature values of each tweet.
# Update bow[i] with bag-of-words freq of the tweet id (i+1)
for i in range(len(preprocess_df)):
freq = bow_array[i]
for f in freq:
bow.loc[i+1,f]=freq[f]
Finally, we will join our training data dataframe with our bag-of-words. pandas Dataframe’s join
method allows us to add columns from one dataframe to another for rows where the index matches. Note that we append '_data'
as a suffix to all columns from the ‘left’ dataframe, which is preprocess_df
. This is to avoid conflicts between the keyword given as part of the training data, and ‘keyword’ as a lemma-token-feature. Remember to save the preprocessed .csv file for easier next steps!
# Join bag-of-words representation to train dataframe
# Append _data suffix to 'keyword','location','text','target' for features that are not lemma tokens
preprocess_df = preprocess_df.join(bow,lsuffix='_data')
# Saving bag-of-words representation for collaborators
preprocess_df.to_csv("data/train_preprocessed.csv",index=True,index_label='id')
Splitting into training and validation data
Now that we’ve pre-processed our data, there’s just one last step before we can jump into using it to train our model of choice. We have split it, stratified according to the distribution of classes, into training and validation sets. Using train_test_split
from sklearn.model_selection
:
from sklearn.model_selection import train_test_split
# stratify=y creates a balanced validation set
y = preprocess_df['target_data']
df_train, df_val = train_test_split(preprocess_df, test_size=0.10, random_state=101, stratify=y)
# Saving csv files for collaborators
df_train.to_csv("data/train_preprocessed_split.csv",index=True)
df_val.to_csv("data/val_preprocessed_split.csv",index=True)
print(df_train.shape, df_val.shape)
(6851, 21330) (762, 21330)
Just to be sure, we can check our balance.
# Checking balance
print(f"""
Ratio of target=1 to target=0 tweets in:n
Original data set = {np.sum(preprocess_df['target_data']==1)/np.sum(preprocess_df['target_data']==0)},
n
Training data set = {np.sum(df_train['target_data']==1)/np.sum(df_train['target_data']==0)},
n
Validation data set = {np.sum(df_val['target_data']==1)/np.sum(df_val['target_data']==0)}""")
This prints:
Ratio of target=1 to target=0 tweets in:
Original data set = 0.7533394748963611,
Training data set = 0.7535193242897363,
Validation data set = 0.7517241379310344
Ready, set, go!
Pre-processing vs Tuning hyperparameters
If you have seen other NLP pre-processing tutorials, you’ll find that a lot of their steps have been included as considerations, but not implemented here. These include removing punctuations, numbers, and stop words. However, our training dataset is small, and these steps could remove information valuable in the context of tweets and the challenge. Hence, rather than eliminate these data at the pre-processing stage, we have left them as possible ways to tune the hyper-parameters of our model.
Possible extensions
Through this tutorial, we have pre-processed tweets into their bag-of-words representation. However, you may choose to go a few steps further with Term Frequency – Inverse Document Frequency (TFIDF). TF-IDF stands for "Term,Information Retrieval and Text Mining, which penalizes terms that appear too often (as they become less discriminatory as features), or word vectors, which also numerically account for the word’s context and semantics. Word vectors encoding will result in better performance than TFIDF encoding, which will result in better performance than bag-of-words encoding.
We have ignored location
and keyword
in this tutorial, focusing entirely on tweets. You could consider encoding location by similarity, accounting for different spellings of the same place (e.g. USA vs U.S.), and missing values. You can also weight keywords heavier and see how that affects the performance of your model.
Lastly, there may be valuable information in the URLs that we are missing out. Given that they are in shortened form, we are unable to extract the domain name or page content from the text data alone. You could consider building an algorithm to visit the site and extract the domain name, as well as scrape relevant elements on the page (e.g. page title).
Next steps
Now that we have performed basic pre-processing our dataset, we can move forward in two possible directions. You can either continue on to advanced pre-processing by performing spell checks and correction with Pyspellchecker and Mordecai, or try out and evaluate candidate machine learning models! Possible models for such classification problems include logistic regression, neural networks, and SVMs.
References
[1] Kaggle, Disaster tweets classification challenge on Kaggle (2020), Kaggle
[2] D. Becker and M. Leonard, Intro to NLP (n.d.), Natural Language Processing Course on Kaggle
[3] D. Becker and M. Leonard, Text Classification with SpaCy (n.d.), Natural Language Processing Course on Kaggle
[4] Yse, D. L. Your Guide to Natural Language Processing (2019), Towards Data Science
[5] Explosion AI, spaCy’s 101: Everything you need to know (n.d.), spaCy