
For the longest time, I used regular expressions with copy-pasted stackoverflow code and never bothered to understand it, so long as it worked.
However, I soon realized that to tackle software and data-related tasks such as web scraping, sentiment analysis, and string manipulation, regex was a must-have tool.
This article aims to demystify regex characters, and the python functions for handling them, by tackling them one at a time while providing clear and simple examples.
What is a regular expression? Simply put, it is a sequence of characters that make up a pattern to find, replace, and extract textual data. Regex contains its own syntax and characters that vary slightly across different Programming languages.
Pattern – This refers to a regular expression string, and contains the information we are looking for in a long string. It could be a word, a series of regex special symbols, or a combination of both.
Match – If the pattern is found in the string, we call this substring a match, and say that the pattern has been matched. There can be many matches in one operation.
Advantages of regular expressions:
- Regex operations are faster to execute than manual string ones.
- Every modern programming language contains a regex engine for handling regular expressions, and the same logic applies across the languages.
- Regex offers a way to quickly write patterns that will tackle complex string operations, saving you time.
Regular expressions in Python
Python’s engine for handling regular expression is in the built-in library called re.
Once you import the library into your code, you can use any of its powerful functions mentioned below.
These functions take as arguments:
- the pattern to search for,
- the string of text to search in
- an optional flags parameter that alters the behavior of the regex engine. An example is flags=re.I that makes the pattern case-insensitive. Other flags covered here are re.M, re.S, and re.L.
Using the re library, you can:
- Check for the presence of a pattern in a string.
- Return the number of times the pattern is present.
- Get the position(s) of the pattern.
Note: Regex searches text from left to right. Once matched, that part of the string is used up and cannot be matched again in the same operation.
re.findall(pattern, text) – This function returns all the matched strings in a list. __ The code below searches for ‘the’ in the text and returns a list of all matched strings.
text = 'There was further decline of the UHC'
re.findall("the", text)
###Output
['the', 'the']
The regex pattern is case-sensitive, so use the parameter flags=re.I
or flags=re.IGNORECASE
for case-insensitive.
re.findall("the", text, flags=re.I)
###Output
['The', 'the', 'the']
Note how parts of ‘There’ and ‘further’ are also matched because it looks for this exact sequence of characters despite what comes before or after.
re.search(pattern, text) – Returns the first occurrence of the match as a match object.
re.search("the", text)
###Output
<re.Match object; span=(13, 16), match='the'>
The match object contains information about the matched string, such as its span (start and end position in the text), and the match string itself. You can further extract these details by calling its .group(), .span(), .start(), and .end() methods as shown below.
match_obj = re.search("the", text)
#index span of matched string
print(match_obj.span())
### (13, 16)
#the matched string
print(match_obj.group())
### the
#start position of match
print(match_obj.start())
### 13
#end position of match
print(match_obj.end())
### 16
re.match(pattern, text) – This checks the beginning of the text for the pattern. If present at the very start, that match object is returned, otherwise, None. In the code below, ‘the’ (case-sensitive) does not appear in the beginning, and therefore no output is printed.
re.match('the', text)
### No output
We can also use an if-else statement that prints a custom message if a pattern is present or not.
text = 'The focus is on 2022'
is_match = re.match('the',
text,
re.I)
if is_match:
print(f"'{is_match.group()}'
appears at {is_match.span()}")
else:
print(is_match) #None
###Output
'The' appears at (0, 3)
re.finditer(pattern, text) – This returns an iterator of match objects that we then wrap with a list to display them.
text = 'There was further decline of the UHC'
match = re.finditer('the', text,
flags=re.I)
list(match)
###
[<re.Match object; span=(0, 3), match='The'>,
<re.Match object; span=(13, 16), match='the'>,
<re.Match object; span=(29, 32), match='the'>]
re.sub(pattern, repl, text) – this replaces the matched substring(s) with the 'repl'
string.
text = 'my pin is 4444'
re.sub('4', '*', text)
###Output
'my pin is ****'
re.split(pattern, text) – This splits the text at the position of the match(es), into elements in a list.
text = "wow! nice! love it! bye! "
re.split("!", text)
###Output
['wow', ' nice', ' love it', ' bye', ' ']
Regex metacharacters
Meta characters are symbols that have a special meaning in regular expression language, and this is where the power of regex shines.
In this section, we’ll explore the different metacharacters, and use re.findall()
to check for the presence of a pattern in a string and return all the matched substrings.
Using r (python raw string): Preceding a pattern with an r converts all the pattern’s characters into normal literals, removing any special meanings such as backslashes as escape characters. The regex engine can now search for its special characters, and as you will see, the backslash features prominently.
Escaping regex characters with a backslash: When you want to exactly search for any of the below regex symbols in a text, you have to escape them with a backslash (while also using the r raw string) so that they lose their special regex meaning too.
- . (The dot character, or wildcard) -this matches and returns any character in the string, except a new line. This could be a digit, white space, letter, or punctuation.
pattern = r'.'
re.findall(pattern,
"Wow! We're now_25")
###Output
['W', 'o', 'w', '!', ' ', 'W', 'e', "'", 'r', 'e', ' ', 'n', 'o', 'w', '_', '2', '5']
- w (lowercase w) – Any alphanumeric character (letter, digit, or underscore).
pattern = r'w'
re.findall(pattern,
"Wow! We're now_25")
###Output
['W', 'o', 'w', 'W', 'e', 'r', 'e', 'n', 'o', 'w', '_', '2', '5']
- W (uppercase w) – anything that is not w such as spaces, and special characters.
pattern = r'W'
re.findall(pattern,
"Wow! We're now_25")
###Output
['!', ' ', "'", ' ']
- d – any digit, 0 to 9.
pattern = r'd'
re.findall(pattern,
"Wow! We're now_25")
###Output
['2', '5']
- D – Any non-digit. Negates d.
pattern = r'D'
re.findall(pattern,
"Wow! now_25")
###Output
['W', 'o', 'w', '!', ' ', 'n', 'o', 'w', '_']
- s (lowercase s) – A white space.
pattern = r's'
re.findall(pattern,
"Wow! We're now_25")
###Output
[' ', ' ']
- S (uppercase s) – Negates s. Returns anything that is not a white space.
pattern = r'S'
re.findall(pattern,
"Wow! Now_25")
###Output
['W', 'o', 'w', '!', 'N', 'o', 'w', '_', '2', '5']
Character sets
- [] matches any of the characters inside the square brackets. For example, the pattern ‘[abc]’ looks for either a or b or c in the text, and can also be written as ‘a|b|c’. You can also define a range inside the brackets using a dash, instead of writing down every single character. For example, [a-fA-F] matches any lowercase or uppercase letters from a to f. The code below returns any vowels.
pattern = r'[aeiou]'
re.findall(pattern,
"Wow! We're now_25")
###Output
['o', 'e', 'e', 'o']
- [^] Having a hat ^ character right after the opening square bracket negates the character set. It returns the opposite of the characters or ranges inside the square brackets. The code below returns everything except the letters m to z.
#Any char except letters m to z
pattern = r'[^m-zM-Z]'
re.findall(pattern,
"Wow! We're now_25")
###Output
['!', ' ', 'e', "'", 'e', ' ', '_', '2', '5']
Repetition regex patterns
Also called quantifiers, these special characters are written right after a pattern or character to tell the regex engine how many times to match it.
+
(once or more) – Matches if the previous pattern appears one or more times. The code below matches the character'o'
that is preceded by'hell'
.
#match o in hello once or many times
text = 'hell hello ago helloo hellooo'
pattern = r'hello+'
re.findall(pattern, text)
###Output
['hello', 'helloo', 'hellooo']
- *``** (zero or more)—Matches if the previous pattern appears zero or many times.
#match o in hello zero or many times
text = 'hell hello ago helloo hellooo'
pattern = r'hello*'
re.findall(pattern, text)
['hell', 'hello', 'helloo', 'hellooo']
?
(zero or once)— Matches if the previous pattern appears zero or one time.
#match o in hello zero times or once
text = 'hell hello ago helloo hellooo'
pattern = r'hello*'
re.findall(pattern, text)
['hell', 'hello', 'hello', 'hello']
- {n} -Defines the exact number of times to match the previous character or pattern. e.g
'd{3}'
matches'ddd'
.
#Extract years
text = '7.6% in 2020 now 2022/23 budget'
pattern = r'd{4}'
re.findall(pattern, text)
['2020', '2022']
- {min,max} – Defines the minimum (min) and maximum (max) times to match the previous pattern. e.g.
'd{2,4}'
matches'dd'
,'ddd'
and'dddd'
.
#Dot followed by 2 to 5 word chars
text = '[email protected] [email protected] [email protected]'
pattern = r'.w{2,5}'
re.findall(pattern, text)
['.com', '.me', '.biz']
- {min, } – matches the previous element at least
'min'
times.
#Long words
text = 'universal healthcare is low'
pattern = r'w{5,}'
re.findall(pattern, text)
['universal', 'healthcare']
Greedy quantifiers – All the above quantifiers are said to be greedy, in that they attempt to take up as many characters as possible for every match, resulting in the longest match as long as the pattern is satisfied. For example, re.findall('b+', 'bbbb')
returns one match ['bbbb']
, which is the longest possible match, even though ['b', 'b', 'b', 'b']
is still a valid match but with shorter matches.
Non-greedy (lazy) – You can make a quantifier non-greedy by adding a question mark (?) after the quantifier. This means that the regex engine will return the least characters per match. The image below shows a comparison of the quantifiers’ behaviors in greedy vs non-greedy modes.

Boundary/ anchors
- ^ – matches only the start of a text, and therefore ^ is written as the first character in the pattern. Note that this is different from [^..] which negates the pattern enclosed in square brackets.
#Starts with two digits
text = '500,000 units'
pattern = r'^dd'
re.findall(pattern, text)
###Output
['50']
- $ – matches the end of the string and is therefore written at the end of a pattern.
#Ends with two digits
text = '500,000 units'
pattern = r'dd$'
re.findall(pattern, text)
###Output
[]
- b (word boundary) – Matches the boundary right before or after a word, or the empty string between a w and a W.
pattern = r'b'
re.findall(pattern,
"Wow! We're now_25")
###Output
['', '', '', '', '', '', '', '']
To see the boundaries, use the re.sub()
function to replace b
with the ~ symbol.
pattern = r'b'
re.sub(pattern,
'~',
"Wow! We're now_25")
###Output
"~Wow~! ~We~'~re~ ~now_25~"
Groups
- () – When you write a regex pattern, you can define groups using parentheses. This is useful for extracting and returning details from a string. Note that the parentheses do not change the results of a pattern, rather they group it into sections that you can retrieve separately.
text = 'Yvonne worked for von'
pattern = r'(.o.)'
re.findall(pattern, text)
###Output
['von', 'wor', 'for', 'von']
Accessing matched groups using m.group()
– In the code below, we capture different parts of an email into 3 groups. m.group()
and m.group(0)
both return the entire matched string. m.group(1)
, m.group(2)
, and m.group(3)
return the different groups respectively. The parentheses are a convenient way to access groups in a match.
text = 'this is @sue email [email protected]'
pattern = r'(w+)@(w+).(w+)b'
m = re.search(pattern, text)
#match object
print(m)
### <re.Match object; span=(19, 32),
match='[email protected]'>
#full match
m.group(0)
### '[email protected]'
m.group(1)
### 'sue'
m.group(2)
### 'gmail'
m.group(3)
### 'com'
Referencing groups using group_num
— A regex pattern can contain several groups, as seen in the previous email example that contains 3 groups. You can use 1, 2… in the regex pattern to reference and match a group by position starting from 1. The code below looks for substrings where a character has been repeated.
text = 'hello, we need 22 books'
pattern = r'(w)1'
list(re.finditer(pattern, text))
###Output
[<re.Match obj; span=(2,4), match='ll'>,
<re.Match obj; span=(11,13), match='ee'>,
<re.Match obj; span=(15,17), match='22'>,
<re.Match obj; span=(19,21), match='oo'>]
Naming and accessing captured groups using ?P<name>
and ?P=name
respectively— You can assign a name to a group to access it later. This is better than the grp_number
notation when you have many groups, and it increases the readability of your regex. To access matched groups, use m.group('name')
.
text = '08 Dec'
pattern = '(?P<day>d{2})s(?P<month>w{3})'
m = re.search(pattern, text)
m.group('day')
###
'08'
Non-capturing groups
- ?: – matches but doesn’t capture the group. Include
?:
in the group you wish to omit. The code below matches numbers with percentage signs and returns the numbers only.
text = 'date 23 total 42% date 17 total 35%'
pattern = r'(d+)(?:%)'
re.findall(pattern, text)
###
['42', '35']
- | (or) – This returns all matches of either one pattern or another.
text = 'the sunny sun shines'
re.findall(r'sun|shine', text)
###Output
['sun', 'sun', 'shine']
Pandas and regular expressions
Pandas is a powerful Python library for analyzing and manipulating datasets. Regular expressions are handy when searching and cleaning text-based columns in Pandas.
Pandas contains several functions that support pattern-matching with regex, just as we saw with the re
library. Below are three major functions we’ll use in this tutorial. Read about other Pandas regex functions here.
- Series.str.contains(pattern) – This function checks for a pattern in a column (Series) to return True and False values (a mask) where the pattern matches. The mask can then be applied to the entire dataframe to only return True rows.
- Series.str.extract(pattern, expand, flags) – To use this function, we must define groups using parentheses inside the pattern. The function extracts the matches and returns the groups as columns in a dataframe. When you have only one group in the pattern, use
expand=False
to return a series instead of a dataframe object. - Series.str.replace(pattern, repl, flag) — Similar to
re.sub()
, this function replaces matches with the repl string.
In this section, we’ll tackle seven regular expression tasks to perform the following actions;
- filter data to return rows that match certain criteria.
- extract substrings and other details from a column.
- replace values in a column.
To illustrate this, I used the titanic dataset from Kaggle available here under GNU Free Documentation License. On a new Jupyter notebook, import pandas and load the data.
import pandas as pd
df = pd.read_csv('titanic.csv')

Filtering a dataframe – s.str.contains(pattern)
Task 1: Filter the dataframe to return rows where the ticket numbers had C and A.
Below is a list of all the ‘C A’ ticket variations that must all be matched.

The regex pattern starts with capital C, followed by an optional dot, then capital A, followed by an optional dot. We have escaped the dot symbol to exactly match a period, not the wildcard regex character.
pattern = r'C.?A.?'
mask = df['Ticket'].str.contains(pattern)
df[mask].head(10)

To get the total number of rows with CA, use mask.sum()
which sums up all the True
values which appear as 1 and False
as 0.
mask.sum()
### 42
Extracting data – s.str.extract(patt)
Task 2: Extract all unique titles such as Mr, Miss, and Mrs from passenger names.

Regex pattern: this will search for a white space, followed by a sequence of letters (enclosed in parentheses), then a dot character. We use parentheses to group the substring we want to capture and return. expand=False
returns a Series object enabling us to call value_counts()
on it.
pattern = 's(w+).'
all_ts = df['Name'].str.extract(
pattern,
expand=False)
unique_ts = all_ts.value_counts()

Task 3a: From the ‘Name’ column, extract the titles, first names, and last names, and return them as columns in a new dataframe.

Regex pattern; Each name contains a ‘sequence of one or many word characters’ (last name), then a comma, a white space, another sequence of characters(title), a period, a space, another sequence of characters (first name), then zero to many other characters.
pattern = r'(w+), (w+.) (w+).*'
df_names = df['Name'].str.extract(
pattern,
flags=re.I)
df_names

Task 3b: Clean up the dataframe above with named and ordered columns.
As mentioned earlier, named groups are useful for capturing and accessing groups. With Pandas, this is especially convenient, as these names will now be the columns-names in our new dataframe.
The regex below is the same as above, except that the groups are named using (?P<name>).
The code extracts the three named columns, then we use df.reindex()
to reorder them.
pattern = r'(?P<lastname>w+),s
(?P<title>w+.)s
(?P<firstname>w+).*'
df_named = df['Name'].str.extract(
pattern,
flags=re.I)
df_clean = df_named.reindex(
columns =
['title',
'firstname',
'lastname'])
df_clean.head()

Replacing values in a column – s.str.replace(pattern, repl)
Task 4a: Replace all the titles with capital letters.

The regex pattern searches for a white space, then one or many word characters (enclosed in parentheses), then a period. We then replace the captured group with its capitalized version.
pattern = r's(w+). '
df['Name'].str.replace(pattern,
lambda m:m.group().upper())
The lambda function above means that for every row, take the captured group and convert it into uppercase.

Task 4b: Capitalize only Mr.
and Mrs.
titles. In this case, we use | inside parentheses.
pattern = r's(Mr|Mrs).s'
df['Name'].str.replace(pattern,
lambda m:m.group().upper(),
flags=re.I)

Task 5: Clean the dates in the column below by inserting dashes to show the day, month, and year. We want to conserve the other words in the column, therefore cannot directly call [pd.to_datetime()](https://www.w3resource.com/pandas/to_datetime.php)
.

The pattern: Search for two digits, then two digits again, then four digits, and use parentheses to capture them in three groups. The lambda function means that for every match, join the groups with a dash after the first and the second group.
pattern = r'(d{2})(d{2})(d{4})'
d['A'].str.replace(pattern,
lambda m: m.group(1)+'-'+
m.group(2)+'-'+
m.group(3))

Conclusion
In this article, we learned about regular expressions and used the Python re library and the Pandas string functions to explore the different special regex metacharacters.
We also learned that it’s important to precede every pattern with an r raw string to remove any special string meanings, and to escape special regex characters with a backslash when exactly matching them in a string. You can find all the code here on GitHub.
Remember that most regular expressions are unique to the situation, so just experiment with every possible scenario to avoid false matches and misses. Most times, there are many ways to write a regex, so choose the easiest and most sensible to you. You can use websites such as regex101.com to test until satisfied.
I hope you enjoyed the article. To receive more like this whenever I publish, subscribe here. If you are not yet a medium member and would like to support me as a writer, follow this link and I will earn a small commission. Thank you for reading!