The world’s leading publication for data science, AI, and ML professionals.

Regex essential for NLP

Understanding various regular expressions and applying them to frequently encountered situations in Natural Language Processing

Photo by Nathaniel Shuman on Unsplash
Photo by Nathaniel Shuman on Unsplash

Why are regular expressions essential for NLP?

Whenever we deal with text data it is almost always never in the form we want it to be. The text may have words we want to remove, punctuation that is not needed, hyperlinks or HTML that can be done away with and dates or numerical entities that can be made simpler. In this article, I will describe some basic Regular expressions in Python using the re module and some common situations in NLP where I end up using them. Please note that the regular expressions in this article are mentioned in increasing order of complexity.


Some basic terminology

Before we get started let’s just get some basics straight. I will only be explaining terms that are later used in this article so that nothing is too overwhelming.

w  represents any alphanumeric characters (including underscore)
d  represents any digit
.   represents ANY character (do not confuse it with a period )
abc literally matches characters abc in a string
[abc] matches either a or b or c (characters within the brackets)
?   after a character indicates that the character is optional
*   after a character indicates it can be repeated 0 or more times
+   after a character indicates it can be repeated 1 or more times
   is used to escape special characters (we'll use this alot!)

Escaping special characters

From the section above you can see that many characters are used by re as special characters and have their own meanings. For example . ? and even / are some of the characters which will escape literal matches. In this case, if we literally want to match these characters we must precede them with a backslash (). Consider ? this will literally match the string for a ?


Let’s get our hands dirty!

I will be using the re module in Python to substitute and search for patterns in strings. To use it simply import re. The following regular expressions and use cases are in increasing order of complexity so feel free to jump around.

Situation 1: Removing words occurring at the start or end of the string

Say we have a sentence the friendly boy has a nice dog, the dog is friendly

Now if we want to remove the first ‘the’ we can simply use the regex ^the. While using re.sub the second parameter is the substituted word, since we want to remove it completely we replace our word with empty quotation marks.

Similarly to remove friendly at the end we use the regex friendly$

Note how only the words at the beginning and end were removed and their other occurrences remained unchanged.


Situation 2: Removing numbers and currencies

This one is pretty straightforward. In NLP we often remove numeric values as the model does not truly learn from them, especially when we have lots of different ones.

In the sentence, I have 500 rupees. We can use the expression d+ to match one or more digits.

Now let’s look at a slight variation. Say we have the sentence I have 500$ and you have $200. Note how the dollar sign’s position is different. To handle this we can use [$d+d+$] which uses to literally match $ and ensures that digits occurring either before or after the dollar sign are matched.


Situation 3: Handling all kinds of date formats

The problem with dates is you can write them in different ways. 14–07–2021, 14/07/2021 and 14.07.2021 all mean the same. The expression below handles all the formats.


Situation 4: Removing hyperlinks

When we encounter URLs that we completely want to do away with we can use the expression https?://.*[rn]* This matches only URLs starting with https://


Situation 5: Extracting the main domain name of a URL

In NLP we may need to analyse URLs. If we are only interested in the domain name and not links to particular pages or query parameters then we need to use an expression to make all such links uniform.

Say I want to extract ONLY rajsangani from [https://rajsangani.me/](https://rajsangani.me/)about and https://rajsangani.me/ or www.rajsangani.me ( this last one doesn’t exist)

In that case, I can use the code below

Note how we use re.search instead of re.sub here. We are asking the code to search for the URL starting after either a ‘ . ‘ or a ‘/’ and ending before a ‘ . ‘


Situation 6: Removing SOME punctuation

In NLP we must preprocess our text according to the task at hand. In some cases, we need to remove all punctuation in a string but during a task like sentiment analysis, it is important to hold onto some punctuations like ‘!’ which express strong sentiment.

If we want to remove all punctuation we can simply use [^a-zA-Z0–9] which matches everything except alphanumeric characters.

But say we want to remove everything except exclamation marks. We can then use the following piece of code.

#Removing all punctuation except exclamation marks
re.sub(r'[.;:,?"'/]','','''Hi, I am :" Raj Sangani ! Nice to meet you!!!!!!''')
#Result: Hi I am  Raj Sangani ! Nice to see you!!!!!!

Feel free to play around with the tokens inside the expression as per your need.


Situation 7: Replacing two or more Exclamation marks with a single one

Again, coming to sentiment analysis we may face sentences that have more than one punctuation clubbed together. This is very common with exclamation marks, especially on social media platforms like Twitter. In the sentence Hi! I am Raj Sangani!! Nice to see you!!!!!! there are redundant exclamation marks. If we want to replace 2 or more exclamation marks with a single one we can use the expression below. {2,} indicates 2 or more.


Situation 8: Replacing 2 or more consecutive whitespaces with a single one

While scraping or converting PDFs into text files we are often left with strings that have more than one whitespace between words. What’s worse is that in some places words are separated perfectly (with one whitespace) and in others, there are 2 or even 3 whitespaces. To make the spacing uniform we can use the code below.

Again notice the space before {2,} within the quotes, this is what matches with whitespaces.


Situation 9: Combining words that are hyphenated or space separated

This is a really interesting situation that I recently came across while preprocessing for a project. Words like cheesecake are sometimes written as two spaced words cheese cake or appear in the hyphenated form cheese-cake. If we want both occurrences to be replaced by cheesecake we can use the code below.

For example in the sentence, I like strawberry cheese cake more than blueberry cheese-cake or chocolate cheesecake.

The ? here indicates that the whitespace and hyphen are optional.


Conclusion

I hope this was helpful to everyone who encounters these situations regularly while pre-processing text. Please let me know if I have missed any other situations that one commonly encounters.

If you liked this article here are some more!

Representing 5 Features in a single animated plot using Plotly

Powerful Text Augmentation using NLPAUG !

Effortless Exploratory Data Analysis (EDA)

Replacing Lewa!

Check out my GitHub for some other projects. You can contact me here. Thank you for your time!


Related Articles