An Introduction to Regular Expressions

A beginner’s guide to begin using regular expressions with ease

Michael Scognamiglio
Towards Data Science

--

Image by Author

Introduction

The goal of this blog is to give a simple but also intuitive introduction to regular expressions in python for those without prior experience or knowledge. Regular expressions, aka regexes, are sequences of characters that are used to find a pattern in a string or series of strings. To further our understanding of regex and how it works, I will conduct a some tutorials below.

Tutorials

In the following tutorials, we will learn how to:

  • gain access to the regular expression module in python
  • match a string’s pattern using re.search()
  • create more complicated patterns using regex’s metacharacters

How to access the regex module

There are actually two ways to import the regex module and the search function.

The first way is to import the entire regex module and then use the module name as a prefix when you want to use the search function.

import re
re.search(<regex>, <string>)

The second way is to import the search function from the regular expression module by name and thus the function can be called directly without any prefixes.

from re import search
search(<regex>, <string>)

As you can see there is no prefix used, when calling the search function using this method. Also keep in mind, that in the search function, the first argument ‘regex’ is the pattern you want to search for and the second argument ‘string’ is the string you are searching. Also, one important note to keep in mind, the regex search function will only return the first occurrence of a query.

Applying re.search()

To fully understand the search function and how to use it, let’s look at some simple example problems.

Code Input:
import re
s='A-b_123_e3'
re.search('123', s)
Code Output:
<re.Match object; span=(4, 7), match='123'>

As can be seen above, the regex search function returns two important pieces of information.

The first is the span. Span is essentially the position of the substring within the entire string. So in this case, the span is telling us where the first characters we want starts(fourth element) and where the last characters we want ends. (seventh element)

Please keep in mind the end element is not included so even though the index of the ‘3’ character is actually six, the search function will return seven.

The second piece of information given by the search function is the match. Essentially, the match is the characters you searched for which were also found in the larger string.

To summarize, the search function is a powerful tool because it allows you to determine whether a string sequence is in a larger string sequence and if it is, the function informs you of the relative location of the search query.

Another interesting fact about the regex search function is that it can easily be integrated into boolean conditional statements. Please see below example.

Code Input:
a='Jonny paid $5 for lunch and $10 dollars for dinner.'
if re.search('$', a):
print('Found $ signs.')
else:
print("Did not find $ signs.")
Code Output:
Found $ signs.

As you can see, A regex search is essentially a True or False where a search with a solution is considered True and a failed search will return False.

Of Course, these are very simple examples, but the concepts used in the above examples still apply to more complicated problems as well.

Complex Regex Queries using metacharacters

The queries we have done so far have certainly been useful, but in reality the applications so far have been very limited. We so far have only been able to match exact substrings.

However, if we use metacharacters, we can really see the power of regular expressions.

Square brackets are one metacharacter that is really useful for queries. Any characters that are put into square brackets form a character class. Any character that are within a character class will then be returned when using the search function. To understand this better, let’s look at an example.

#First Method
Code Input:
A = 'AB#$_+*87ba_seven'
re.search('[0-9]+', A)
Code Output:
<re.Match object; span=(7, 9), match='87'>
#Second Method
Code Input:
A = 'AB#$_+*87ba_seven'
re.search('[0-9][0-9]', A)
Code Output:
<re.Match object; span=(7, 9), match='87'>

As you can see, there are actually two methods shown above. One of the benefits of regex is there are often multiple ways to solve a problem.

The brackets in this case makes a character class of digits where any integer between 0 and 9 is valid. The first method uses the ‘+’ character which is referenced in the table. It looks for one or more occurrences of a substring. So in this case, since there are two digits, the ‘+’ tells the search function to look for a second number behind the first. The second method just repeats the square bracket method twice.

Another important metacharacter is the period. dot (.) is a wildcard which will match with any character except for a newline. To see how this works, let’s look at another example.

Code input:
print(re.search('123.abc','123abc'))
print(re.search('123.abc','123\abc'))
print(re.search('123.abc','123/abc'))
Code Output:
None
None
<re.Match object; span=(0, 7), match='123/abc'>

As you can see, the dot (.) metacharacter returns a result only when there is a valid character at the equivalent postion in the larger string. No character or a backslash (newline) will return nothing but as you can see in the third example, any other character will work.

Metacharacters \d,\D, \w, \W, \s,\S

These metacharacters are used to identify particular types of characters. \d is for any decimal digit characters, and \D is for any character that is not a decimal digit. The same idea applies for alphanumeric characters with \w. \w is equivalent to [a-zA-Z0–9_] which we discussed earlier. Like \D, \W is the opposite of it’s lowercase equivalent. /s and /S uses the same idea for whitespace characters. Let’s see some examples.

Code input:
print(re.search('\w', '#@! .h3.&'))
print(re.search('\W', '#@! .h3.&'))
print(re.search('\d', '#@! .h3.&'))
print(re.search('\D', '#@! .h3.&'))
print(re.search('\s', '#@! .h3.&'))
print(re.search('\S', '#@! .h3.&'))
Code Output:
<re.Match object; span=(5, 6), match='h'>
<re.Match object; span=(0, 1), match='#'>
<re.Match object; span=(6, 7), match='3'>
<re.Match object; span=(0, 1), match='#'>
<re.Match object; span=(3, 4), match=' '>
<re.Match object; span=(0, 1), match='#'>

As you can see using these special characters, we can confirm whether this string contains alphanumeric characters and whitespace. Assuming, we do find these characters, then we will also find the exact positions of these unique characters.

The \ character

The backslash is a very special and powerful tool in the regex toolbox. As we have seen before, the backslash character can be used to introduce special character classes like alphanumeric characters or whitespace. It is also used for Anchors which is another type of metacharacter. It can also be used to escape metacharacters.

Anchors \A,\Z, \B

\A is one useful anchor that is used to attached a query to the beginning of the searched string. Thus, the search will only return a result if the beginning of the search string exactly matches the query.

Code Input:
print(re.search('\Achoc', 'chocolate bar'))
print(re.search('\Abar', 'chocolate bar'))
Code Output:
<re.Match object; span=(0, 4), match='choc'>
None

/Z is essentially the exact opposite of /A. Thus, in this case, the search funtion will only return a result if the end of the search string exactly matches the query.

Code Input:
print(re.search('bar\Z', 'chocolate bar'))
print(re.search('late\Z', 'chocolate bar'))
Code Output:
<re.Match object; span=(10, 13), match='bar'>
None

As you can see above, only the first case returns a result because the search string’s ending characters match the query. However, in the second case, since we chose a different character segment, the search function returns nothing.

/b is very useful as it anchors a match to a boundary. So /b needs there to be a word boundary for there to be a result. /b asserts that the current position of the parser is either the This may be hard to understand in words so let’s look at an example.

Code Input:
print(re.search(r'\bbar', 'chocolate bar'))
print(re.search(r'\bbar', 'chocolatebar'))
Code Output:
<re.Match object; span=(10, 13), match='bar'>
None

As you can see, when there is a boundary like the first case (whitespace) then using /b we do get a result using the search function. However, in cases like the second where there is no boundary between words, then there will be no output.

Conclusion

The regex library is very powerful and queries can become very complex. Thus, the goal of this blog was to give beginner’s an idea of what is regex and how they can use it. However, to keep things simple and concise, a lot of the more complicated queries and methods were omitted from this entry. Thus, I recommend to anyone who found this helpful, to please look into more Regex documentation online as there is plenty.

--

--