How to preprocess social media data and text messages

coz imo many ppl write like thisss onl 🤦🏻‍♂️ (╯°□°)╯︵ ┻━┻

Wanshun Wong
Towards Data Science

--

Photo by Jonas Leupe on Unsplash

One of the biggest challenges of performing NLP tasks on social media data and on text messages is that internet English is vastly different from standard English. There are slangs, acronyms, abbreviations, initialisms, emojis, hashtags, URLs, and among other things, misspellings in Reddit posts, tweets, WhatsApp messages, etc that do not appear in standard English. In a deep learning utopia we would scrape as much data as possible, and train our model and tokenizer from scratch to figure out the meaning of these slangs and acronyms and so on. In real life, however, we might want to make use of existing pre-trained models (and their associated tokenizers) like BERT. Since these models and their tokenizers are usually pre-trained on the English Wikipedia and BooksCorpus data, we will need to preprocess our internet English data first. On the other hand, if we decide to use a non-deep learning approach, then data preprocessing is even more important as it has a huge impact on feature engineering. In this article we will cover some data preprocessing steps that are specific to internet English data, or require special attention.

Common NLP Data Preprocessing Steps

In this section we discuss some of the most common NLP data preprocessing steps. They are so prevalent that we might just take it for granted and perform these steps without a second thought. We will most likely still need these data preprocessing steps for internet English data, but with suitable modifications.

Convert Text to Lowercase

For tasks like sentiment analysis, letter case can sometimes be a useful feature. The following sentences

  • “i have passed my driving test”
  • “i have PASSED my driving test”
  • “I HAVE PASSED MY DRIVING TEST”

show different levels of excitement of the author. However, they will become indistinguishable once we convert the text to lowercase.

Suggestion: Engineer some features about letter case before we convert the text to lowercase.

Remove Punctuations

Punctuations, like letter case, are very useful for sentiment analysis. A simple example is the use of consecutive exclamation marks:

  • “i have passed my driving test”
  • “I have passed my driving test!!!!!!!!”

Moreover, it is quite common to use currency symbols in internet slang to indicate greed and corruption, such as “micro$oft”.

Punctuations are also capable of conveying different feelings via emoticons such as :-) and (>_<). Obviously, we do not want to blindly throw away all this information.

Suggestion: Feature engineering on the appearances of different punctuations, especially on consecutive appearances. For emoticons, we can either engineer new features, or we can replace them with English words by using a dictionary. For example, we can replace :-) with “smiley face”. Finally, internet slangs will be covered in a section below.

Note that punctuation removal also needs to happen after the preprocessing of hashtags, URLs, etc.

Remove Numbers

Numbers have a very interesting role in NLP. For tasks like question answering, numbers are absolutely important. For named entity recognition, numbers are not important unless they are part of entity names (e.g. 76ers in NBA, C9 and G2 in esports). For sentiment analysis, numbers are usually irrelevant.

Having said that, there are always exceptions that we need to pay attention to. 1984 that appears in the politics subreddit is not just a number, “1–7” conveys negative sentiment for Brazilian football fans.

What we have covered so far holds for both internet English and standard English, but internet English has an extra complication in the usage of numbers in slangs, such as “10q” and “2mr”.

Suggestion: There are so many different usage of numbers that there are no fixed rules on how to preprocess them. The preprocessing step will depends heavily on the nature of the data set and the results of exploratory data analysis. Slangs will be covered in the next section.

Expand Abbreviations, Acronyms, Initialisms, and Slangs

The expansion is quite easy as it only requires looking up a dictionary. The difficult part is about maintaining this dictionary. For internet slangs, we can obtain their meaning via websites like Urban Dictionary or even just Google. However, many internet slangs are initialisms for exaggerative phrases containing swear words and body parts. We may want to tone them down during the expansion. For example, “lmfao” expanded simply to “laughter”.

Suggestion: During exploratory data analysis, collect all the high frequency terms that are not standard English words. Create a dictionary for expanding these terms, and tone down the exaggeration and/or offensiveness of the expanded phrases if needed.

Data Preprocessing Steps For Internet English

In this section, we introduce new data preprocessing steps that are tailor-made for internet English. Most of them work for the internet version of other languages as well.

URLs, User Mentions, and Hashtags

For most tasks, URLs are not relevant and can simply be removed. However, for topic model and topic classification, it is crucial to understand what the URLs are about. One easy way to incorporate this information is replacing URLs by their titles. This can be done by first doing a HTTP GET, and then parsing the HTML response.

The same idea also applies for user mentions and hashtags. Nonetheless, it is much harder to decide what text we should use for replacement, and the process will require a lot of manual inspection. Consider “@ManCity” in Twitter for example. Its user name is “Manchester City”, but we might want to use “Manchester City Football Club” instead in order to better illustrate the nature of this user mention.

Emojis

The preprocessing step for emojis is basically the same as that for emoticons. For instance, we can replace 😀 with “grinning face”. The meaning of different emojis can be found e.g. at emojipedia

Note that sometimes emojis are used in place of English characters. In this case, the meaning of the emojis can be added to the end of the sentence. As an illustration, “g⚽al” is converted to “goal football”.

Misspellings

There are different types of misspellings, and each requires a different preprocessing method.

  • Typo: This is the easiest one, as we only need to run our text through a spell checking library. Having said that, keep in mind that no spell checking library is perfect, and we should always be prepared to add extra logic.
  • Repeated Characters: Repeated characters like “goaaallllllllll” are useful for sentiment analysis and worth some feature engineering. For the spelling correction part, notice that most spell checking libraries make use of Levenshtein distance. Therefore repeated characters will hurt the performance of spelling correction. To solve this problem, we can first use regular expression to reduce the number of repeated characters down to 2 (since standard English words contain at most double characters). Then we can apply our spell checking library.
  • Others: There are numerous ways for users to write meaningful messages with misspellings, from “hahahahahahahaha” to “G O A L”. A practical approach is to find out the most common patterns in the data set by exploratory data analysis, and only focus on them.

Further Reading

  1. [1] takes a deep learning approach to tackle internet slangs. It trains a set of word embeddings on the content of Urban Dictionary, and the initial evaluation results look quite promising.
  2. The HuggingFace transformers documentation on tokenizer [2] is a great introduction to the subject. In particular, it provides many examples and also references for different kinds of tokenizers. If we want to use a pre-trained model such as BERT, we will have to understand how to preprocess our data so that it works well with the BERT tokenizer.

References

  1. S. Wilson, W. Magdy, B. McGillivray, K. Garimella, and G. Tyson. Urban Dictionary Embeddings for Slang NLP Applications (2020), LREC 2020
  2. Tokenizer Summary

--

--