
Regex is the most potent technique to clean and extract data. If you’ve ever worked with a large text dataset, you’d know how time-consuming and energy-draining it would be.
I often use regex to clean phone numbers and emails and standardize addresses. But there are complex use cases as well.
We noticed inconsistent office columns in our recent data pipeline from a specific data source. We only needed the office code from this column. It’s two or three letters followed by a colon and a two-digit number. Earlier, we used a simple replace operation to map the column to our desired values. But as new data proved inconsistent with our assumption, we had to change the strategy. Since we can ensure the pattern is consistent, we used regex to clean them. This way, we never have to worry about changing column values.
But if your dataset is significantly large, and you need the extracted values stored in new columns next to each row, you’d be tempted to use the map or apply methods in Pandas. But Pandas natively provides excellent APIs for string operations. Almost all of them support regex.
# With apply
import re
df.office.apply(lambda x: re.search(r'd+', x).group(0))
# With native string operation
df.office.str.extract(r'(d+)')
Before we discuss the benefits of these native APIs over the map/apply method, here’s what I mean.
Comparing Pandas string extract with the usual regex
The following code generates a synthetic dataset using the Faker. It generates 100K fake addresses and stores them in a Pandas Series. You can adjust the size by changing n to a more considerable value your computer can support.
import pandas as pd
from faker import Faker
faker = Faker()
n = 100000
address= pd.Series([faker.address() for i in range(n)])
"""
| | address |
|---:|:-----------------------------|
| 0 | 548 Small Parkways Suite 832 |
| | South Brianborough, DC 50474 |
| 1 | 94291 Jerry Pass Suite 992 |
| | East Rebecca, PR 87881 |
| 2 | 3973 Wise Spring |
| | Grantfort, AS 52002 |
| 3 | 62589 David Island |
| | East Kathleenville, OH 45208 |
| 4 | 0415 Jimenez Hill Apt. 642 |
| | Gambleland, WA 99356 |
"""
Our goal is to extract the state and zip codes to separate columns. This task is easy enough to do in spreadsheet software. But let’s keep it for our discussion. And let’s assume we’re using regex.
Here is my usual way of mapping or applying a regex to the series.
(
(address)
.map(lambda x: (re.search(r"w{2} d{5}", x).group()))
.str.split(" ", expand=True)
.rename(columns={0: "state", 1: "zip_code"})
)
"""
| | state | zip_code |
|---:|:--------|-----------:|
| 0 | DC | 50474 |
| 1 | PR | 87881 |
| 2 | AS | 52002 |
| 3 | OH | 45208 |
| 4 | WA | 99356 |
"""
The code above is easy to understand. We match all the two letters followed by a whitespace and five digits. Then we do a string split and expand it to separate columns. Finally, we name the columns ‘state’ and ‘zip_code.’
But here’s the Pandas’s way of doing it.
address.str.extract(r"(?P<state>w{2}) (?P<zip_code>d{5})")
"""
| | state | zip_code |
|---:|:--------|-----------:|
| 0 | DC | 50474 |
| 1 | PR | 87881 |
| 2 | AS | 52002 |
| 3 | OH | 45208 |
| 4 | WA | 99356 |
"""
This is unarguably more elegant than the previous code. We use named groups in our regex pattern, which becomes the column name.
On a separate note, you can make part of a regex pattern a group by wrapping them in parentheses. You can name each group by adding ?P<group_name>
before you start describing your pattern.
Okay, the native method is excellent in readability. But what about performance?
I’ve used the timit
utility in Jupyter notebook to record the execution times. I don’t see the native way having the upper hand on performance. Mapping is faster.
But our desired output is not yet finished with a single map function. We need to do additional steps to get in there. The whole set of operations costs slightly more time than the extract method.

Besides readability, both methods aren’t too different. But the difference becomes significant if you’re working with a vast dataset.
Also, if your code runs in a resource-constrained environment, you must decide carefully. That’s often my case, as I build data pipelines mostly. I need to ensure that I’m using the optimal code for faster and cheaper processing of live data.
We know there is more than one way to do wrangling in Pandas. If you anticipate re-running the scripts in the future, you may have to spend some time experimenting with different alternatives.
A Little Pandas Hack to Handle Large Datasets with Limited Memory
Useful Pandas string methods with regex
Now that we know how easy to use regex operations directly without mapping or applying a function, here are some methods I frequently use.
We’ve already seen one example of using the extract
API in the previous section. It is handy with regex patterns; perhaps that’s the one I use most.
Here are three other techniques I often use.
1. Split text to separate columns with a complex pattern.
Let’s suppose a column contains state and zip codes. We need them separated into individual columns. Since this is sourced from a free-form input form, the separator is not always a whitespace or a comma.
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame(
{
"location": [
"New York 10001",
"California 90001",
"Texas-75201",
"Georgia 30301",
"Oregon97205",
"Arizona 85001",
"Illinois 60601",
"Florida 33101",
"Ohio 44101",
"Pennsylvania-19104",
]
}
)
df["location"].str.split(r"[s|-]*(d{5})", expand=True)
'''
| | 0 | 1 |
|---:|:-------------|------:|
| 0 | New York | 10001 |
| 1 | California | 90001 |
| 2 | Texas | 75201 |
| 3 | Georgia | 30301 |
| 4 | Oregon | 97205 |
| 5 | Arizona | 85001 |
| 6 | Illinois | 60601 |
| 7 | Florida | 33101 |
| 8 | Ohio | 44101 |
| 9 | Pennsylvania | 19104 |
'''
2. Filter records that contain a text pattern somewhere in the middle.
I had a dataset where there was an office serial number. The number had a pattern. The first two letters denote the country’s shortcode. A location code follows the country code. It’s a three-digit number. Then a hyphen and a department id, which is another three-digit number.
Suppose we need to filter all the records related to the finance department of countries, UK, India, and Australia. We can do something like this.
import pandas as pd
# Create a sample dataframe
data = {
"office_serial_number": [
"US101-001",
"UK201-006",
"CA301-003",
"AU401-004",
"UK202-005",
"IN302-006",
"IR102-007",
"AU402-006",
"SL303-009",
"UK203-010",
"FR403-011",
"US103-012",
]
}
df = pd.DataFrame(data)
df[df.office_serial_number.str.contains("^(UK|IN|AU)d{3}-006")]
'''
| | office_serial_number |
|---:|:-----------------------|
| 1 | UK201-006 |
| 5 | IN302-006 |
| 7 | AU402-006 |
'''
This could be a tough task if it’s not for regex. And it is not a readable one, either.
3. Replacing patterns with a new string
Replace is a frequent string operation. Even in Excel, we do it a lot. But some replacement operations are more complex than a straightforward find and replace.
We need to find patterns and replace them with new strings.
Take a phone number column, for example. You need to remove the country codes from the column. Some records have a country code, and some don’t. Even the one with a country code has different formats.
Here’s a simple regex example of doing this.
import pandas as pd
import re
# create a sample dataframe (dummy)
df = pd.DataFrame({'phone': ["+1 555-555-5555", "+44 20 7123 4567", "+81 3-1234-5678", "0049 30 12345678", "+61 2 1234 5678", "+33 1 23 45 67 89", "+86 10 1234 5678", "011 52 55 1234 5678", "+971 4 123 4567", "+49 911 12345678", "(+81) 3-1234-5678"]})
# define a regular expression pattern to match the country code
pattern = r'^+?d{1,3}[s-]?' # match + or digit(s) followed by space/hyphen
# apply the pattern to the 'phone' column and replace the matches with an empty string
df['phone'] = df['phone'].apply(lambda x: re.sub(pattern, '', x))
print(df)
phone
0 555-555-5555
1 20 7123 4567
2 3-1234-5678
3 30 12345678
4 2 1234 5678
5 1 23 45 67 89
6 10 1234 5678
7 52 55 1234 5678
8 4 123 4567
9 911 12345678
10 3-1234-5678
Conclusion
You’d be fine with your spreadsheet software for basic string manipulations. But for more advanced use cases, a Programming language could save hours of your time.
Some operations are complex even to handle even with the basic APIs of a programming language. Especially the one where patterns are involved. This is where we get a regex for rescue.
If you’re a Pandas user, you can directly use regex in its native APIs. This has the advantage of a clean codebase with fewer lines. That’s the focus of this post.
I’ve discussed some of my favorite regex tricks in Pandas. Although there aren’t any significant performance improvements, I still prefer these methods because they are straightforward.
Thanks for reading, friend! Say Hi to me on LinkedIn, Twitter, and Medium.
Not a Medium member yet? Please use this link to become a member because, at no extra cost for you, I earn a small commission for referring you.