The world’s leading publication for data science, AI, and ML professionals.

Pandas Beyond Numbers: Working with Textual Data

Explained with examples.

Photo by Aaron Burden on Unsplash
Photo by Aaron Burden on Unsplash

Pandas is the most widely-used Python data analysis and manipulation library. Since data does not always come in a clean numerical representation, a data analysis library should be able to handle data in any format. Pandas does it and does it well.

In this post, we will explore the capabilities of Pandas with textual data (i.e. strings). Let’s start with the data types for textual data.


Object vs String

Before pandas 1.0, only the "object" data type was used to store strings which cause some drawbacks because non-string data can also be stored using the "object" data type. For instance, a column with object data type can have numbers, text, dates, and lists which is not an optimal way for data analysis.

Pandas 1.0 introduces a new datatype specific to string data which is StringDtype. As of now, we can still use object or StringDtype to store strings but in the future, we may be required to only use StringDtype.

As of now, "object" is still the default data type to store strings. In order to use StringDtype, we need to explicitly state it.

import pandas as pd
pd.Series(['aaa','bbb','ccc']).dtype
dtype('O')
pd.Series(['aaa','bbb','ccc'], dtype='string').dtype
StringDtype
pd.Series(['aaa','bbb','ccc'], dtype=pd.StringDtype()).dtype
StringDtype

StringDtype is used if "string" or pd.StringDtype() is passed as an argument to dtype parameter.


We need textual data to work with. I will first get a paragraph from the Wikipedia page of Pandas and store it in a list.

text = ["Developer Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library."]

Text is a list with one item. I will convert it to a Pandas series that contains each word as a separate item.

A = pd.Series(text).str.split().explode().reset_index(drop=True)
A[:5]
0    Developer 
1          Wes 
2     McKinney 
3      started 
4      working 
dtype: object

Str is the attribute to access string operations. Split, as the name suggests, splits the data based on a specified character (the default is space). We then used the explode function to use each separated word as a new item in the series.

If we did not reset the index, all items in the series A would have the index 0 due to the nature of the explode function. It is worth mentioning that there are other ways to create a series with words of a text which is one of the nice things about Pandas. It almost always offers multiple ways to do a task.


Some words start with a capital letter which is not desirable in data analysis. We want to have a standard as much as possible. We can either convert them to all capital or lowercase letters.

A.str.upper()[0]
'DEVELOPER'
A.str.lower()[0]
'developer'

All items are converted but we only accessed the first item.


The len function returns the length of an object. If it is a list, len returns the number of items in a list. If len is used with the str attribute, it returns the length of each string.

A.str.len()[:5]
0    9 
1    3 
2    8 
3    7 
4    7

If you want to combine all the strings in this series back to its original format, cat is the function you are likely to use:

A.str.cat(sep=" ")
developer wes mckinney started working on pandas in 2008 while at aqr capital management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. before leaving aqr he was able to convince management to allow him to open source the library.

The string operations offer a great deal of flexibility. For instance, we can replace not only strings with other strings but also replace part of a string (i.e. character-wise).

A.str.replace('the', 'not-a-word')[-3:]
45        source 
46    not-a-word 
47      library.
A.str.replace('deve', ' ')[:3]
0       loper 
1         wes 
2    mckinney

The str accessor provides indexing on characters of strings. For instance, we can get the first 3 characters only. It comes in handy if there are some redundant characters at the end or beginning of a string.

first_three = A.str[:3]
first_three[:2]
0    dev 
1    wes

Pandas string methods are also compatible with regular expressions (regex). The entire scope of the regex is too detailed but we will do a few simple examples. Consider we have strings that contain a letter and a number so the pattern is letter-number. We can use this pattern extract part of strings.

B = pd.Series(['a1','b4','c3','d4','e3'])
B.str.extract(r'([a-z])([0-9])')

We may also want to check if all the strings have the same pattern.

C = pd.Series(['a1','4b','c3','d4','e3'])
C.str.contains(r'[a-z][0-9]')

We can also count the number of a particular character in strings.

B = pd.Series(['aa','ab','a','aaa','aaac'])
B.str.count('a')
0    2 
1    1 
2    1 
3    3 
4    3

In some cases, we may need to filter strings based on a condition. For instance, startswith and endswith functions can be used with str attribute.

B = pd.Series(['aa','ba','ad'])
B.str.startswith('a')
0     True 
1    False 
2     True
B.str.endswith('d')
0    False 
1    False 
2     True

Some Machine Learning algorithms require to encode categorical variables. The str attribute supports the get_dummies function.

cities = ['New York', 'Rome', 'Madrid', 'Istanbul', 'Rome']
pd.Series(cities).str.get_dummies()

Pandas string operations are not limited to what we have covered here but the functions and methods we discussed will definitely help to process string data and expedite data cleaning and preparation process.


Thanks for reading. Please let me know if you have any feedback.


Related Articles