The world’s leading publication for data science, AI, and ML professionals.

A Gentle Introduction to Regular Expressions with Python

Regular expressions are the data scientist's most formidable weapon against unstructured text

Tutorial | Python

Photo by Marius Masalar on Unsplash
Photo by Marius Masalar on Unsplash

We live in a data-centric age. Data has been described as the new oil. But just like oil, data isn’t always useful in its raw form. One form of data that is particularly hard to use in its raw form is unstructured data.

A lot of data is unstructured data. Unstructured data doesn’t fit nicely into a format for analysis, like an Excel spreadsheet or a pandas DataFrame. Text data is a common type of unstructured data and this makes it difficult to work with. Enter regular expressions, or regex for short. They may look a little intimidating at first, but once you start to use them, you’ll be as comfortable as this snake in no time 🐍 !

More comfortable with python? Try my tutorial for using regex with R instead:

A Gentle Introduction to Regular Expressions with R

The Regex Module (re)

We’ll use the regular expressions module. To import this into your python project, use the following command:

See how easy that is? The re module is built into python, so there is no need to install it. Let’s take a look at a couple of the functions we have available to us in this module:

  1. re.findall(pattern, string): This function returns a list containing all instances of pattern in string
  2. re.sub(pattern, repl, string): This function returns string with instances of pattern in string replaced with repl
Photo by Tim Collins on Unsplash
Photo by Tim Collins on Unsplash

You may have already used these functions. They have pretty straightforward applications without adding regex. Think back to the times before social distancing and imagine a nice picnic in the park. Here’s an example string with what everyone is bringing to the picnic. We can use it to demonstrate the basic usage of the regex functions:

basic_string = 'Drew has 3 watermelons, Alex has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels'

If I want to pull every instance of one person’s name from this string, I would simply pass the name and basic_string to re.findall():

The result will be a list with all occurrences of the pattern. Using this example, basic_find will be a list with one item:

['Drew']

Now let’s imagine that Alex left his 4 hamburgers unattended at the picnic and they were stolen by Shawn. re.sub() can replace any instances of Alex with Shawn:

The resulting string will show that Shawn now has 4 hamburgers. What a lucky guy 🍔 .

Drew has 3 watermelons, Shawn has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels

The examples so far are pretty basic. There is a time and place for them, but what if we want to know how many total food items there are at the picnic? Who are all the people with items? What if we need this data in a pandas dataframe for further analysis? This is where you will start to see the benefits of regex.

Regex Vocab

There are several concepts that drive regex:

  1. Character sets
  2. Meta characters
  3. Quantifiers
  4. Capture Groups

This is not an exhaustive list, but is plenty to help us hit the ground running.

Character Sets

Character sets represent options inside of brackets, with regex matching only one of the options. There are multiple things we can do with character sets:

  • Match a group of characters: We can find all of the vowels in our string by putting every vowel in brackets, for example,[aeiou]

    ['e', 'a', 'a', 'e', 'e', 'o', 'e', 'a', 'a', 'u', 'e', 'a', 'i', 'a', 'a', 'a', 'a', 'e', 'a', 'a', 'a', 'o', 'e', 'e']
  • Match a range of characters: We can find any capital letter from "A" to "F," by using a hyphen, [A-F]. Character sets are case sensitive, so [A-F] is not the same as [a-f]

    ['D', 'A', 'K', 'A']
  • Match a range of numbers: We can find numbers between a range by adding numbers to our character set, [0-9] to find any number.

    ['3', '4', '1', '2', '6']

Character sets can contain everything from this section simultaneously, so something like [A-Ct-z7-9] is still valid. It will match every character from capital "A" to capital "C," lowercase "t" to lowercase "z," and 7 through 9.

So far we can’t answer any of the questions posed earlier with bracket groups. Let’s add some more weapons to our regex arsenal.

Meta Characters

Meta characters represent a type of character. They will typically begin with a backslash “. Each one will match to a single character. Here are some of the most important ones in action:

  • s: This meta character represents spaces. This will match to each space, tab, and newline character. You may also specify t and n for tab and newline character respectively. Side note: our example string does not have any tabs, but be cautious when looking for them. Many integrated development environments, or IDE’s, have a setting that will replace all tabs with spaces while you are typing. In the example string, s returns a string of 17 spaces, the exact number of spaces in our example string!

    [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
  • w: This meta character represents alphanumeric characters. This includes all the letters a-z, capital and lowercase, and the numbers 0–9. This would be the equivalent of the bracket group [A-Za-z0-9], just much quicker to write. Take caution in remembering that the w meta character on its own only captures a single character, not entire words or numbers. You’ll see that in the example. Don’t worry, we’ll get to how to handle that soon.

    ['D', 'r', 'e', 'w', 'h', 'a', 's', '3', 'w', 'a', 't', 'e', 'r', 'm', 'e', 'l', 'o', 'n', 's', 'A', 'l', 'e', 'x', 'h', 'a', 's', '4', 'h', 'a', 'm', 'b', 'u', 'r', 'g', 'e', 'r', 's', 'K', 'a', 'r', 'i', 'n', 'a', 'h', 'a', 's', '1', '2', 't', 'a', 'm', 'a', 'l', 'e', 's', 'a', 'n', 'd', 'A', 'n', 'n', 'a', 'h', 'a', 's', '6', 's', 'o', 'f', 't', 'p', 'r', 'e', 't', 'z', 'e', 'l', 's']
  • d: This meta character represents numeric digits. Using our picnic example from earlier, see how it only finds the digits in the string. You’ll notice that like bracket groups, it picks up 5 numbers instead of the 4 we expect. This is because it is looking for each individual digit, not groups of digits. We’ll see how to fix that with quantifiers next.

    ['3', '4', '1', '2', '6']

Quantifiers

As we saw in the previous section, a single meta character can have somewhat limited functionality. When it comes to words or numbers, we usually want to find more than 1 character at a time. This is where quantifiers come in. They allow you to quantify how many of a character you are expecting. They always come after the character they are quantifying and come in a few flavors:

  • + quantifies 1 or more matches. Let’s look at a new example to develop some intuition about what each quantifier will return: quant_example

    When we use the + quantifier on quant_example, it will return 4 matches. This is a good point to mention that regex looks for non-overlapping matches. In this case, it looks at each B and the character that follows it. Since we used the + quantifier, it continues to match until it reaches the end of a group of B’s.

    ['B', 'BB', 'BBB', 'BBBB']
  • {} quantifies a specific number or range of matches. When written like {2} it will match exactly 2 of the preceding character. We’ll see some interesting results. It picked up 4 matches. This is because it is looking for each non-overlapping group of 2 B’s. There is a match in the 1st group, only 1 non-overlapping match in the 2nd group, and 2 non-overlapping matches in the 4th.

    ['BB', 'BB', 'BB', 'BB']

When written like {2,4}, it will match any number of B’s from 2 to 4 occurrences. Note that putting a space in your regex will NOT work. It will return an empty list.

['BB', 'BBB', 'BBBB']

We can also write this quantifier and omit the upper bound like {2,}. This will match 2 or more instances. For quant_example, it will return the exact same result as {2,4}.

  • * quantifies zero or more matches. This can be helpful when we are looking for something that may or may not be in our string.

The * quantifier returns some strange matches when used by itself, so we can omit an example with quant_example. We will see in a following example how it can be applied when someone at our picnic is bringing a food item with a multiple word name. Without it, we wouldn’t correctly capture that Anna is bringing soft pretzels!

Let’s combine what we know so far about character sets, meta characters, and quantifiers to answer some questions about our picnic string. We want to know all of the words that are in the string and also the numbers in the string.

For words, we can use a character set with all upper and lower case letters, adding a + quantifier to it. This will find any length of alpha characters grouped together. Said another way, it finds all of the words. Regex is starting to look much more helpful.

['Drew', 'has', 'watermelons', 'Alex', 'has', 'hamburgers', 'Karina', 'has', 'tamales', 'and', 'Anna', 'has', 'soft', 'pretzels']

To find the quantity of each food item, we can use the d meta character and the quantifier {1,2}. This will find the groups of digits that are 1 or 2 characters long. This is a much more useful output as we have the same number of quantities as we have food items and people!

['3', '4', '12', '6']

To find the quantity and name of each food item, we can combine quantifiers with meta characters. We know that each number has a food item directly after it, so we can just add on to the previous example. We know there is a space and a word (sw+) that could be followed by another word like how "soft pretzel" appears. To specify the second word might not be there, we can use the * quantifier with the second word. Just like that we have a list containing the quantity and name of every good at our picnic.

['3 watermelons', '4 hamburgers', '12 tamales', '6 soft pretzels']

Capture Groups

Capture groups allow you to look for entire phrases and only return parts of them. With our example, I want each person’s name, what they are bringing, and how much of it they are bringing.

['Drew has 3 watermelons',
 'Alex has 4 hamburgers',
 'Karina has 12 tamales',
 'Anna has 6 soft pretzels']

The regex we used in capture_group1 is looking for a name, which starts with a capital letter and has any amount of lowercase letters after it ([A-Z][a-z]+). Then after a space it matches the pattern space, word, space sw+s. Next we are looking for a 1 to 2 digit number followed by a space and a word (d{1,2}sw+). You can see in the output we get a string with the details for each person.

Now this is a big step up from where we started, but we don’t really care about the word "has", and we want to be able to make a pandas dataframe out of the quantities. Let’s add in capture groups. By using capture groups, we can return a tuple with the desired information. We’ll create capture groups containing each name, quantity, and item. Capture groups are simply sections of the regex that you wrap in parenthesis.

[('Drew', '3', 'watermelons'), ('Alex', '4', 'hamburgers'), ('Karina', '12', 'tamales'), ('Anna', '6', 'soft pretzels')]

Just like that we now have a list of tuples containing the exact information that we want!

Combining our Text into a Dataframe

When doing data analysis, one of the most useful python data structures is a pandas dataframe. No doubt you already knew this if you clicked on this article. Dataframes enable things like calculating column statistics and plotting data. Since we have a list of tuples with all of the information we want in our dataframe, we can just iterate over the list, building our dataframe.

| Name   | Quantity | Item          |
| ------ | -------- | ------------- |
| Drew   | 3        | watermelons   |
| Alex   | 4        | hamburgers    |
| Karina | 12       | tamales       |
| Anna   | 6        | soft pretzels |

Conclusion and Further Learning

We only covered a small subset of how regex can help handle unstructured text data. This is a good foundation to get started, but before long you will need to know concepts like how to find everything BUT a character (negation) or find something immediately before or after something else (lookarounds). Check out my other post on those concepts.

Anchors Away! More Python Regular Expressions You Wish You Knew

Here are some more resources to help you learn more about these other concepts in regex:

  • The official re documentation: While documentation can seem intimidating, learning how to read it will only help you later while you program
  • w3schools References: A huge knowledge base of coding and scripting language references, including python. Many of their examples can be run right from their browser by clicking on the "Try it Yourself" button
  • Datacamp Courses (PAID link): An online learning community dedicated to Data Science, machine learning, and data visualization. Check out their course "Regular Expressions in Python." The first chapter of every course on the site is free!

Join Medium with my referral link – Drew Seewald


Related Articles