The world’s leading publication for data science, AI, and ML professionals.

A simple intro to Regex with Python

We go through basic examples of using Regex with Python and show how this framework can be used for powerful text analytics.

Introduction

Text mining is a hot topic in data science these days. The volume, variety, and complexity of textual data are increasing at an astounding space.

As per this article, the global text Analytics market was valued at USD 5.46 billion in 2019 and is expected to reach a value of USD 14.84 billion by 2025.

Regular expressions are used to identify whether a pattern exists in a given sequence of characters (string) or not and also to locate the position of the pattern in a corpus of text. They help in manipulating textual data, which is often a pre-requisite for Data Science projects that involve text analytics.

It is, therefore, important for budding data scientists, to have a preliminary knowledge of this powerful tool, for future projects and analysis tasks.

In Python, there is a built-in module called re, which needs to be imported for working with Regex.

import re

This is the starting point of the official documentation page.

In this short review, we will go through the basics of Regex usage in simple text processing with some practical examples in Python.

The ‘match’ method

We use thematch method to check if a pattern matches a string/sequence. It is case-sensitive.

A compiled program

Instead of repeating the code, we can use compile to create a regex program and use built-in methods.

So, compiled programs return special object e.g. match objects. But if they don’t match it will return None, and that means we can still run our conditional loop!

Positional matching

We can easily use additional parameters in the match object to check for positional matching of a string pattern.

Above, we notice that once we created a program prog with the pattern thon, we can use it any number of times with various strings.

Also, note that the pos argument is used to indicate where the matching should be looked into. For the last two code snippets, we change the starting position and get different results in terms of the match. although the string is identical.

A simple use case

Let’s see a use case. We want to find out how many words in a list have the last three letters with ‘ing’.

The search method

For solving the problem above, we could have used a simple string method.

What’s so powerful about regex?

The answer is that it can match a very complex pattern. But to see such advanced examples, let’s first explore the search method.

Note, how the match method returns None (because we did not specify the proper starting position of the pattern in the text) but the search method finds the position of the match (by scanning through the text).

Naturally, we can use the span() method of the match object, returned by search, to locate the position of the matched pattern.

The findall and finditer methods

The search is powerful but it is also limited to finding the first occurring match in the text. To discover all the matches in a long text, we can use findall and finditer methods.

The findall method returns a list with the matching pattern. You can count the number of items to understand the frequency of the searched term in the text.

The finditer method produces an iterator. We can use this to see more information, as shown below.

Wildcard matching (with single characters)

Now, we gently enter the arena where Regex shines through. The most common use of Regex is related to ‘wildcard matching’ or ‘fuzzy matching’. This is where you don’t have the full pattern but a portion of it and you still want to find where in a given text, something similar appears.

Here are various examples. Here we will also apply thegroup() method on the object returned by search to essentially return the matched string.

Single-character matching by DOT

Dot . matches any single character except the newline character.

Lowercase w to match any single letter, digit or underscore

DOT is limited to alphabetical characters, so we need to expand the repertoire with other tools.

(W or uppercase W) matches anything not covered with w

There are symbols other than letter, digits, and underscore. We use W to catch them.

Matching patterns with whitespace characters

s (lowercase s) matches a single whitespace character like space, newline, tab, return. Naturally, this is used to search for a pattern with whitespace inside it e.g. a pair of words.

d matches numerical digits 0–9

Here is an example.

And here is an example of a practical application. Suppose, we have a text describing scores of some students in a test. Scores can range from 10–99 i.e. 2 digits. One of the scores is typed wrongly as a 3-digit number (Romie got 72 but it was typed as 721). The following simple code snippet catches it using d wildcard matching.

Start of a string

The ^(caret) matches pattern at the beginning of a string (but not anywhere else).

End of a string

The $ (dollar sign) matches a pattern at the end of the string. Following is a practical example where we are only interested in pulling out the patent information of Apple and discard other companies. We check the end of the text for ‘Apple’ and only if it matches, we pull out the patent number using the numerical digit matching code we showed earlier.

Wildcard matching (with multiple characters)

Now, we can move on to more complex wildcard matching with multiple characters, which allows us much more power and flexibility.

Matching 0 or more repetitions

*``** matches 0 or more repetitions of the preceding regular expression.

Matching 1 or more repetitions

+ causes the resulting RE to match 1 or more repetitions of the preceding RE.

Matching precisely 0 or 1 repetition

? causes the resulting RE to match precisely 0 or 1 repetitions of the preceding RE.

Controlling how many repetitions to match

{m} specifies exactly m copies of RE to match. Fewer matches cause a non-match and returns None.

{m,n} specifies exactly m to n copies of RE to match. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound.

{m,n}? specifies m to n copies of RE to match in a non-greedy fashion.

Sets of matching characters

[x,y,z] matches x, y, or z.

Range of characters inside a set

A range of characters can be matched inside the set. This is one of the most widely used regex techniques. We denote range by using a -. For example, a-z or A-Z will match anything between a and z or A and Z i.e. the entire English alphabet.

Let’s suppose, we want to extract an email id. We put in a pattern matching regex with alphabetical characters + @ + .com. But it cannot catch an email id with some numerical digits in it.

So, we expand the regex a little bit. But we are only extracting email ids with the domain name ‘.com’. So, it cannot catch the emails with other domains.

It is quite easy to expand on that but clever manipulation of email may prevent extraction of with such a regex.

Combining the power of Regex by OR-ing

Like any other good computable objects, Regex supports boolean operation to expand its reach and power. OR-ing of individual Regex patterns is particularly interesting.

For example, if we are interested to find phone numbers containing ‘312’ area code, the following code fails to extract it from the second string.

We can create a combination of Regex objects as follows to expand the power,

A combined example

Now, we show an example of extracting valid phone numbers from a text using findall() and the multi-character matching tricks we learned so far.

Note that a valid phone number with 312 area code is of the pattern 312-xxx-xxxx or 312.xxx.xxxx.

The Split method

Finally, we talk about a method that can be used in creative ways to extract meaningful text from an irregular corpus. A simple example is shown below, where we build a Regex pattern with the extrinsic characters which are messing up a regular sentence and use the split() method to get rid of those characters from the sentence.

Summary

We reviewed the essentials of defining Regex objects and search patterns with Python and how to use them for extracting patterns from a text corpus.

Regex is a vast topic, with almost being a small programming language in itself. Readers, particularly those who are interested in text analytics, are encouraged to explore this topic more from other authoritative sources. Here are a few links,

Regular Expressions Demystified: RegEx isn’t as hard as it looks

For JavaScript enthusiasts,

An Introduction to Regular Expressions (Regex) In JavaScript

Top 10 most wanted Regex expressions, ready-made for you,

Regex cookbook – Most wanted regex


Also, you can check the author’s GitHub repositories for code, ideas, and resources in machine learning and data science. If you are, like me, passionate about AI/machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Tirthajyoti Sarkar – Sr. Principal Engineer – Semiconductor, AI, Machine Learning – ON…


Related Articles