The world’s leading publication for data science, AI, and ML professionals.

Best String Super Skills you must have: REGEX

Get out of trouble when it comes to text-string processing issues in Python with this super easy intro.

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

When working with data, there is always the possibility of having to deal with text. Be prepared for when the time arrives, you will be finding, processing and dealing pretty well with alphanumeric strings, with the help of a powerful friend inside Python.

I am talking about REGEX. Regex stands for Regular Expression and it describes a special sequence of characters used to search and manipulate words, digits, or other characters in text strings.

This introductory piece of content aims to give you a nice approach to the subject and to provide you with some (more) preliminary knowledge thought as essential to get you familiarized with Regex.

Because Regex is so powerful, vast, and complex, I bet you will be also sharing the same perspective that an infinite number of pythonic-possibilities open up before us. Let’s take a look at some basic commands.

To begin with, besides Pandas, let’s import the re module, unlocking everything related to Regex in python.

import pandas as pd
import re

We also need to define a Text on which we will be working on some examples throughout this article.

I selected a sentence written by Mat Velloso in a very curious attempt to distinguish between Machine Learning and Artificial Intelligence, in simple terms.

Mat Velloso is a Technical Advisor to the CEO at Microsoft and developer advocate at heart. https://twitter.com/matvelloso
Mat Velloso is a Technical Advisor to the CEO at Microsoft and developer advocate at heart. https://twitter.com/matvelloso

Let’s cut and stick to a shorter part of the text to make it simpler:

text = "If it is written in PYTHON, it's probably machine learning"

re.findall()

The re module has a set of functions that allow us to search for, return and replace a string or any part of a string. We start by the findall() function which returns a list containing all occurrences.

rx = re.findall(r'.*PYTHON', text) print(rx)

['If it is written in PYTHON']

Understanding: firstly, the findall() function returns a list of occurrences in the same order it finds it. Secondly, the 'r' is at the beginning to ensure that the string is seen as a "raw string".

Moving to the '.*PYTHON' part. We want to return everything until the word PYTHON, inclusive. Therefore, .* is some sort of a trump symbol, in the sense that * repeats everything zero or more times, until it finds the star, and . tells the star to return everything it finds, be it letters, numbers, symbols or spaces.

If we invert the command, we receive the other half of the sentence, see an example.

rx = re.findall(r'PYTHON.*', text) print(rx)

["PYTHON, it's probably machine learning"]

Setting re.flags to ignore-case so it matches either if it finds the occurrence in upper or lower cases.

rx = re.findall('python.*',text, flags=re.IGNORECASE) print(rx)

["PYTHON, it's probably machine learning"]

From this point on, we can build a series of possibilities.

rx = re.findall(r'written.*machine', text) print(rx)

["written in PYTHON, it's probably machine"]

rx = re.findall(r'tt.*bl', text) print(rx)

["tten in PYTHON, it's probabl"]

Moving on to other symbols that are used to check if a string starts with, (symbol ^) or ends with (symbol $) a specific character.

  • ^ Evaluates and matches the start of a string (it is the same as A )
  • w+ Matches and returns the alphanumeric character in the string

If we remove the symbol + we receive only the first character.

text = "If it is written in PYTHON, it's probably machine learning"

rx = re.findall(r'^w+', text) print(rx)

['If']

rx = re.findall('learning$', text) print(rx)

['learning']

If doesn’t match, we receive an empty list.

Every time a character matches as much as it can it is said to be Greedy. The symbol ? checks if the next character matches zero or one time starting from that exact position. Meaning it specifies a non-greedy version of *`** and **+`** .

rx = re.findall(r' .*? ', text) print(rx)

[' it ', ' written ', ' PYTHON, ', ' probably ']

The character Braces {b,n} is used when we want to check at least b times, and at most n times, of the pattern.

text = "If it is written in PYTHON, it's probably machine learning"

rx = re.findall(r'(t{1,4}|i{1,})', text) print(rx)

['i', 't', 'i', 'i', 'tt', 'i', 'i', 't', 'i', 'i']

In the next example (below), we are asking to check at least 1 t and at the most 4 t and we get this exact result.

On the other hand, we are also checking for at least 1 e and at the most 3 e, but as you can see there are 4 e in a row, meaning that the 4 e will be split into a group of 3 and that is the reason we get a remaining single e .

rx = re.findall(r'(t{1,4}|e{1,3})', 'listttt compreheeeension') print(rx)

['tttt', 'e', 'eee', 'e']

The use of square brackets [] specifies a set of characters that we want to match. For example, [abc] has 1 match with a, has 3 matches with cab, and no match with hello.

So, we can specify a range of values using (symbol -) inside square brackets. Thus, [a-d] is the same as [abcd], and the range [1-4]is the same as [1234], and so on.

Following the same reasoning, the range [a-z] matches with any lower case letter and [A-Z] with any upper case. If setting the combination of [a-zA-Z] we’re checking both upper and lower at the same time. Let’s try with some examples.

# assigning new text
alpha_num = "Hello 1234"

rx = re.findall(r'[a-z]', 'alpha_num') print(rx)

['e', 'l', 'l', 'o']

rx = re.findall(r'[a-zA-Z]', 'Hello 1234') print(rx)

['H', 'e', 'l', 'l', 'o']

rx = re.findall(r'[a-zA-Z0-9]', 'Hello 1234') print(rx)

['H', 'e', 'l', 'l', 'o', '1', '2', '3', '4']

What if we add the symbol +, what would happen?

rx = re.findall(r'[a-zA-Z0-9]+', 'Hello 1234') print(rx)

['Hello', '1234']

Tip: if the first character inside the set is ^ , everything outside of the set will be matched.

rx = re.findall(r'[^a-zA-Z ]+', 'Hello 1234') print(rx)

['1234']

Does any of this makes any sense to you? Awesome!

Let’s now talk about Special Sequences. These are written with the backslash followed by the desired character (and its meaning).

  • w -As already seen earlier, returns a match where the string contains letters, numbers, and the underscore
  • W -Returns every non-alpha-numeric character
  • d -Returns a match where the string contains digits from zero to nine (0–9).

If the star *`** repeats everything zero or more times, the sign **+`** repeats everything one or more times. So what’s the difference? Let us create another string to exemplify and take a closer look.

# assigning new text
letters_numbers = "The letter A, the character * and the numbers 11, 222 and 3456."

rx = re.findall('w', letters_numbers) print(rx)

['T', 'h', 'e', 'l', 'e', 't', 't', 'e', 'r', 'A', 't', 'h', 'e', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 'a', 'n', 'd', 't', 'h', 'e', 'n', 'u', 'm', 'b', 'e', 'r', 's', '1', '1', '2', '2', '2', 'a', 'n', 'd', '3', '4', '5', '6']

Instead, if we add the symbol +, what would be the difference?

rx = re.findall('w+', letters_numbers) print(rx)

['The', 'letter', 'A', 'the', 'character', 'and', 'the', 'numbers', '11', '222', 'and', '3456']

rx = re.findall('W', letters_numbers) print(rx)

[' ', ' ', ',', ' ', ' ', ' ', '*', ' ', ' ', ' ', ' ', ',', ' ', ' ', ' ', '.']

Only extracting digits:

rx = re.findall('d+', letters_numbers) print(rx)

['11', '222', '3456']

rx = re.findall('d{3,}', letters_numbers) print(rx)

['222', '3456']

Now imagine that we want to extract only the uppercase words from the string in groups of two elements.

upper_extract = "Regex is very NICE for finding and processing text in PYTHON"

rx = re.findall('([A-Z]{2,})', upper_extract) print(rx)

['NI', 'CE', 'PY', 'TH', 'ON']

re.split()

The split method can be handy since it splits the string when it finds a match and returns a list of strings from the exact split.

numbers = 'The air we breath is made up of 78% nitrogen, 21% oxygen and 1% of other stuff.'

rx = re.split(r'd+', numbers) print(rx)

['The air we breath is made up of ', '% nitrogen, ', '% oxygen and ', '% of other stuff.']

If the pattern doesn’t match, the original string is returned.

A useful resource is to set the maximum splits that are possible to occur. We can set this by passing the maxsplit argument into the re.split() method.

rx = re.split(r'd+', numbers, 1) print(rx)

['The air we breath is made up of ', '% nitrogen, 21% oxygen and 1% of other stuff.']

In the next example, set the split at each white-space characater only at the five first occurrences.

rx = re.split(r's', numbers, 5) print(rx)

['The', 'air', 'we', 'breath', 'is', 'made up of 78% nitrogen, 21% oxygen and 1% of other stuff.']

re.sub()

Sub stands for SubString and with this method on your side you replace any matches by any text at any time.

The syntaxt is simple: re.sub(pattern, replacement, string).

Other parameters can be added such the maximum times the replacement occurs, case-sensivity, etc.

text = "If it is written in PYTHON, it's probably machine learning"

rx = re.sub(r'written', 'coded', text) print(rx)

If it is coded in PYTHON, it's probably machine learning

rx = re.sub(r'it', 'THAT', text) print(rx)

If THAT is wrTHATten in PYTHON, THAT's probably machine learning

In the next example, all we want is replacing ‘it’ by ‘THAT’ but only in the first occurence.

rx = re.sub(r'it', 'THAT', text, count=1) print(rx)

If THAT is written in PYTHON, it's probably machine learning

In the next example, will split by both white-spaces before and after the word ‘PYTHON’ and replace it with ‘code’. Setting to ignore-case so does not matter if we type ‘PYthon’ this way.

rx = re.sub(r'sPYthon,s', ' code, ', text, flags=re.IGNORECASE) print(rx)

If it is written in code, it's probably machine learning

re.subn()

The re.subn() produces the same as re.sub() except it returns the number of replacements made.

rx = re.subn(r'it', 'THAT', text) print(rx)

("If THAT is wrTHATten in PYTHON, THAT's probably machine learning", 3) 

If you’re enjoying this article, I’m sure you find these also interesting.

Pandas made easy (the guide – I)

Machine Learning: costs prediction of a Marketing Campaign (Exploratory Data Analysis – Part I)

Conclusion

As a preliminary approach of what Regex is capable of, I guess that by now you’ll be considering more often to apply these techniques when dealing with alpha-numeric strings.

Regex will have a huge impact on your productivity on Python so I hope you continue to investigate and invest a bit of your time into it because once you feel comfortable with it, you realize that there are almost infinite possibilities and only the sky is the limit.

Feel free to get in touch with me, would be pleased. Thank you!

Contacts

Good readings, great codings!


Related Articles