Regular expressions in Python

Search and Split functionalities in re module

Sohail Hosseini
Towards Data Science

--

Introduction

Regular expressions or regex are a sequence of characters used to check whether a pattern exists in each text (string) or not, for example, to find out if “123” exists in “Practical123DataScie”. The regex parser interprets “123” as an ordinary character that matches only itself in the string. But the real power of regular expressions is when a pattern contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

Regex functionality resides in a module named re. So, like all modules in Python, we only need to import it as follows to start working with.

import re

Very useful functions in re module are covered in this tutorial, such as search() and split() for search and replace, respectively. You also will learn to create complex matching patterns with metacharacters.

I) .search() function

A regular expression search is typically written as:

re.search(pattern, string)

This function goes through the string to locate the first location where there is a match with the pattern. If there is no match, it returns None. Let us look at the following example:

s1= "Practical123DataScie"
re.search("123", s1)
Output: <re.Match object; span=(9, 12), match='123'>

The output provides you a lot of information. It tells you that there is a match and locates at s[9:12] of the string. This is an easy case and we might need to search for complex patterns. Imagine now, you want to look for three consecutive numbers like “456” or “789”. In this case, we would need to use patterns because we are looking for consecutive numbers and we do not know exactly what those numbers are. They could be “124”, “052” and so on. How can we do that?

s2 = “PracticalDataScie052”
re.search(‘[0–9][0–9][0–9]’, s2)
Output: <re.Match object; span=(17, 20), match='052'>

There are a lot of concepts here to talk about. The pattern used here is ‘[0-9][0-9][0-9]’. First, let us talk about square brackets ([]). Regular expression or pattern […] tells you to match any single character in square brackets. For example:

re.search(‘[0]’, s2)Output: <re.Match object; span=(17, 18), match='0'>

This pattern, ‘[0]’, tells to locate character 0 in s2 string and print out if there is a match. If I need to locate more character like three numbers, I can write:

re.search(‘[0][5][2]’, s2)Output: <re.Match object; span=(17, 20), match='052'>

Ok, you are right. I could just type ‘052’ as a pattern to locate it in s2 string, but things get interesting now. I can create another regex within square brackets e.g. which is used for range. What do I mean by that? It means using (-), I can locate for a range of characters. For example:

re.search(‘[0–9]’, s2)Output: <re.Match object; span=(17, 18), match='0'>

It means to find out any digit from zero to nine within s2. So now, let us get back to our question to locate three consecutive numbers. To do that, I can simply write:

re.search(‘[0–9][0–9][0–9]’, s2)Output: <re.Match object; span=(17, 20), match='052'>

Each range within every square bracket tells you to find out a digit number in s2 string. I also would be able to use the range for letters. For example:

re.search(‘[a-z][0–9]’, s2)Output: <re.Match object; span=(16, 18), match='x0'>

This example tells us to locate two characters. The first one, any lowercase letter, and the second character should be a digit. The output (‘x0’) is exactly what we expect to get. Regular expression ‘\d’ is equal to ‘0–9’. So, for the previous example, I also can use:

re.search(‘[a-z][\d]’, s2)Output: <re.Match object; span=(16, 18), match='x0'>

II) .split() function

Similar to the search function, a regular expression split is typically written as:

re.split(pattern, string)

This function splits the string using the pattern as the delimiter and returns the substrings as a list. Let us see at the following example:

re.split(‘[;]’, ‘Data;Science and; Data Analysis;courses’)Output: ['Data', 'Science and', ' Data Analysis', 'courses']

In this example, the pattern is [;], and it means that we have the semicolon (;) as a delimiter. Wherever there is a semicolon at the string, it will be split at that location and saved in a list. We can have more than one delimiter. Let us look at a more complex example.

string = “Data12Science567programbyAWS025GoogleCloud”
re.split(‘\d+’, string)
Output: ['Data', 'Science', 'programbyAWS', 'GoogleCloud']

I this example, our pattern is ‘\d+’, and as we all know ‘\d’ pattern means any digits (0 to 9). By adding a ‘+’ notation at the end will make the pattern match at least 1 or more digits. Therefore, in this case, we see that any consecutive numbers will be considered as a delimited and substrings are returned in a list.

Let us consider the following string. I have two courses in the format of “[Course Number] [Programming Language] [Course Name]”. The string is written in two different lines and the spacing between the words is not equal.

string = ‘’’101              Python       DataScience 
102 R DataAnalysis’’’
re.split(‘\s+’, string)
Output: ['101', 'Python', 'DataScience', '102', 'R', 'DataAnalysis']

In this example, the ‘\s’ pattern matches any whitespace character. By adding a plus sign ‘+’ at the end of it, the pattern will match at least 1 or more spaces.

III) Conclusion

Search and Split functionalities in re module have discussed in this tutorial. Using metacharacters to create different patterns, there would be very beneficial in text mining.

--

--