
This blog post was born out of my own frustration and avoidance of the topic of regular expression (regex) for the longest time.
For months, I have been putting off the idea of learning regex because let’s be honest, they can look extremely daunting especially when you first encounter them. I mean, a string of characters tied together with seemingly no logic behind them whatsoever – nobody’s got time for that!
It wasn’t until a recent task I got given at work that involved retrieving the elements of a string that I finally gained an appreciation for the power of regular expressions. And as it turns out, it is actually not that bad once you understand the fundamentals.
So, in this article, I will be explaining what regular expressions are, introduce some basic regex characters, and most importantly demonstrate, using several practical examples, how to perform regex using the R Programming language. Specifically, we will be discussing the concept of capturing groups in regular expressions.
If you are more of a Python enthusiast, you can find the Python version of the code on my GitHub here.
What are regular expressions?
Regular expression is not a library nor is it a programming language. Instead, regular expression is a sequence of characters that specifies a search pattern in any given text (string).
A text can consist of pretty much anything from letters to numbers, space characters to special characters. As long as the string follows some sort of pattern, regex is robust enough to be able to capture this pattern and return a specific part of the string.
Basic regex characters you need to know
Now, before we get into the nitty-gritty, I think it is crucial that we first go over some of the basics of Regular Expressions.
The examples later on in this article will be building off some of the main concepts illustrated here, namely: characters, groupings, and quantifiers.
Characters
- Escape character: “
- Any character:
.
- Digit:
d
- Not a digit:
D
- Word character:
w
- Not a word character:
W
- Whitespace:
s
- Not whitespace:
S
- Word boundary:
b
- Not a word boundary:
B
- Beginning of a string:
^
- End of a string:
$
Groupings
- Matches characters in brackets:
[ ]
- Matches characters not in brackets:
[^ ]
- Either or:
|
- Capturing group:
( )
Quantifiers
- 0 or more:
*
- 1 or more:
+
- 0 or 1:
?
- An exact number of characters:
{ }
- Range of number of characters:
{Minimum, Maximum}
Regex examples
Don’t worry if the regex characters above don’t make much sense to you now – they merely serve as references for the examples that we are about to go through.
In this section, we will be focusing on 6 different examples that will hopefully reinforce your understanding of regular expressions. Effectively, we will be looking at:
- 2 examples with numbers (phone number and date)
- 2 examples with letters (names and URLs)
- 2 examples with both numbers and letters (Email address and address)
Before we begin, make sure you have the tidyverse package installed and loaded into your working environment.
# Install tidyverse package
install.packages("tidyverse")
# Load tidyverse package
library(tidyverse)
1. Phone number
Suppose we have a data frame called phone, containing a list of phone numbers as follows:

We would like to break up these phone numbers into 3 individual components: area code (first 3 digits), exchange (next 3 digits), and line number (last 4 digits).
As we can see, the number patterns here are not always consistent i.e. they have inconsistent parentheses, hyphens, and spaces. However, with the help of regular expressions, we can easily capture the number groups.
First, we will need to define a regex pattern.
phone_pattern = ".?(d{3}).*(d{3}).*(d{4})"
How exactly do we interpret this? Well, let’s take this step by step, going from left to right:
.?
0 or 1 character to account for the optional open parenthesis(d{3})
3 digit characters (first capture group i.e. first 3 digits).*
0 or more characters to account for the optional closing parenthesis, hyphen, and space characters(d{3})
3 digit characters (second capture group i.e. next 3 digits).*
0 or more characters to account for the optional hyphen and space characters(d{4})
4 digit characters (third capture group i.e. last 4 digits)
We can then use the str_match function to retrieve the capture groups using the regex pattern we have defined and put them into individual columns in the data frame.
phone$area_code = str_match(phone$original_number, phone_pattern)[, 2]
phone$exchange = str_match(phone$original_number, phone_pattern)[, 3]
phone$line_number = str_match(phone$original_number, phone_pattern)[, 4]

2. Date
Suppose we have another data frame called date, which consists of dates with inconsistent delimiters and we want to extract the days, months, and years.

Using a very similar approach to the one we just saw with phone numbers, we need to first define a regex pattern, then match the pattern to the original date column, and finally create a new column for each capture group.
First, define the regex pattern for dates.
date_pattern = "(d{2}).(d{2}).(d{4})"
Here’s the code explanation:
(d{2})
2 digit characters (first capture group i.e. day).
a single character to account for all special characters(d{2})
2 digit characters (second capture group i.e. month).
a single character to account for all special characters(d{4})
4 digit characters (third capture group i.e. year)
Now, we can match the pattern and create individual columns for day, month and year.
date$day = str_match(date$original_date, date_pattern)[, 2]
date$month = str_match(date$original_date, date_pattern)[, 3]
date$year = str_match(date$original_date, date_pattern)[, 4]

3. Names
So far, we have explored two examples of strings that contain only digits and special characters. Let’s now learn how to capture words and letters.
Here I have a data frame called names, with people’s family names, titles, and given names.

Let’s break them up so that they each have their own individual columns.
name_pattern = "(w+),s(Mr|Ms|Mrs|Dr).?s(w+)"
And this is the interpretation:
(w+)
1 or more word characters (first capture group i.e. family name),
comma characters
a single whitespace character(Mr|Ms|Mrs|Dr)
Mr, Ms, Mrs or Dr (second capture group i.e. title).?
0 or 1 full stop character after titles
a single whitespace character(w+)
1 or more word characters (third capture group i.e. given name)
Now, putting them into individual columns.
names$family_name = str_match(names$full_name, name_pattern)[, 2]
names$title = str_match(names$full_name, name_pattern)[, 3]
names$given_name = str_match(names$full_name, name_pattern)[, 4]

4. URLs
Let’s look at another example of strings with words and letters.

By now, you should already be familiar with the process.
url_pattern = "(https?)://(www)?.?(w+).(w+)/?(w+)?"
The interpretation:
(https?)
http or https (first capture group i.e. schema)://
specific special character string(www)?
optional www (second capture group i.e. subdomain).?
0 or 1 full stop character(w+)
1 or more word characters (third capture group i.e. second-level domain).
a single full stop character(w+)
1 or more word characters (fourth capture group i.e. top-level domain)/?
0 or 1 backslash character(w+)?
optional 1 or more word characters (fifth capture group i.e. subdirectory)
Separating the capture groups into individual columns, we get:
url$schema = str_match(url$full_url, url_pattern)[, 2]
url$subdomain = str_match(url$full_url, url_pattern)[, 3]
url$second_level_domain = str_match(url$full_url, url_pattern)[, 4]
url$top_level_domain = str_match(url$full_url, url_pattern)[, 5]
url$subdirectory = str_match(url$full_url, url_pattern)[, 6]

5. Email address
Using the knowledge that we have gained so far about regular expressions, let us now look at two final string examples that contain both letters and numbers.
Suppose we have a list of emails in a data frame called email:

Now, generate a regex pattern to match the username, domain name, and domain.
email_pattern = "([a-zA-Z0-9_-.]+)@([a-zA-Z]+).(.+)"
Let’s have a closer look at the regex and decipher its meaning.
([a-zA-Z0-9_-.]+)
1 or more lowercase letters, uppercase letters, digits, and special characters including underscore, hyphen, and full stop (first capture group i.e. username)@
at symbol([a-zA-Z]+)
1 or more lowercase and uppercase letters (second capture group i.e. domain name).
a single full stop character(.+)
1 or more characters (third capture group i.e. domain)
Then, we apply this regex pattern to the list of emails:
email$username = str_match(email$full_email, email_pattern)[, 2]
email$domain_name = str_match(email$full_email, email_pattern)[, 3]
email$domain = str_match(email$full_email, email_pattern)[, 4]

6. Address
And of course, I left the best example for last. This example is identical to what I was doing at work.
In efforts to recreate that piece of work, here I have made up a data frame called address, with hypothetical addresses. The goal is to retrieve the house number, street name, suburb, state, and postcode.

As usual, we need to first define a regex pattern.
address_pattern = "(d*)s?(.+),s(.+)s([A-Z]{2,3})s(d{4})"
And the code explanation:
(d*)
0 or more digit characters because some addresses do not have house numbers (first capture group i.e. house number)s?
0 or 1 whitespace character(.+)
1 or more characters (second capture group i.e. street name),
commas
a single whitespace character(.+)
1 or more characters (third capture group i.e. suburb)s
a single whitespace character([A-Z]{2,3})
2 or 3 uppercase letters (fourth capture group i.e. state)s
a single whitespace character(d{4})
4 digit characters (fifth capture group i.e. postcode)
Matching this pattern to the list of addresses, we get:
address$house_number = str_match(address$full_address, address_pattern)[, 2]
address$street_name = str_match(address$full_address, address_pattern)[, 3]
address$suburb = str_match(address$full_address, address_pattern)[, 4]
address$state = str_match(address$full_address, address_pattern)[, 5]
address$postcode = str_match(address$full_address, address_pattern)[, 6]

I hope, through the 6 examples that I have demonstrated in this blog post, you have not only gained a better understanding of how regular expressions work but more importantly, an appreciation for their flexibility in matching complex string patterns.
If you aspire to become a data analyst or are just keen to improve upon your data wrangling skills, I would highly encourage adding regex into your toolkit. For further practice, I recommend checking out regex101 or regexone.
Thank you so much for reading and happy learning!
More articles on R
Back to Basics – Linear Regression in R