Text Preprocessing with data-describe

Using new Python EDA package to clean messy and unstructured text

Published in

Towards Data Science

8 min readNov 10, 2020

Is there anything so messy as unstructured text? Probably not. Whoever invented capitalization, punctuation and stop words clearly did not take the implications those things have on topic modeling into account, and only cared about their readability to humans. How inconsiderate of them. Luckily enough, we do not have to painstakingly clean our own unstructured text, as people before us have written tools which will allow us to do this easily.

data-describe, which is one such tool, is an open source package which is meant for data scientists to use to accelerate their Exploratory Data Analysis process. While the package has more features other than dealing with text, the remainder of this article explores cleaning up messy data using data-describe, and displays some of its topic modeling functionality afterwards. Full disclosure: I am one of the authors of data-describe, so I am not an unbiased source. However this is not just a shameless plug, so keep reading to see how useful this tool can truly be.

We can start by importing the text preprocessing functions from data-describe, along with the scikit-learn datasets we will be using. The categories were picked at random, so feel free to experiment and play around to see different results.

from data_describe.text.text_preprocessing import *from sklearn.datasets import fetch_20newsgroupscategories = ['rec.sport.baseball', 'rec.autos', 'sci.med', 'sci.space']
text_docs = fetch_20newsgroups(subset='train', categories=categories)['data']

Let’s take a look at the first document in our diverse corpus:

Gross. Plenty of newline characters, capital letters, and punctuation marks to get rid of. Our work is cut out for us. Let’s start by “tokenizing” our text, or splitting a single string document into a list of words.

There are two things to bear in mind as we walk through the individual functions of this text preprocessing module:

For each individual preprocessing function displayed, the output will be a list of strings
In the code, the individual preprocessing functions will be wrapped with another function “to_list”. This is because these functions are generators, meaning they iterate through a list to return values. This allows extensibility for more efficient processing such as multiprocessing.

tokens = to_list(tokenize(text_docs))
tokens[0]

So at this point, we’ve turned our document into a list of words, otherwise known as “bag of words” format. Our next step is to make sure everything is lower case.

lower = to_list(to_lower(tokens))
lower[0]

The words that have been changed to be lowercase are all highlighted. (Image by Author)

Progress! Next, we will remove punctuation. This function can either remove or replace punctuation. The default removes punctuation only when it is a trailing/leading instance, but it can also be set to remove all instances of punctuation.

no_punct = to_list(remove_punct(lower))
no_punct[0]

Red blocks are used to identify areas that had previously contained punctuation. (Image by Author)

After removing punctuation, we can move to the removing of digits and any words that contain digits.

no_digits = to_list(remove_digits(no_punct))
no_digits[0]

Red blocks are used to identify areas that previously contained digits or words with digits. (Image by Author)

The next function will remove any empty spaces or single characters from the text.

no_single_char_and_space = to_list(remove_single_char_and_spaces(no_digits))
no_single_char_and_space[0]

Red blocks are used to identify areas that previously contained spaces or single characters. (Image by Author)

Next is the removal of stopwords. In topic modeling, the idea is to separate text documents by the important words inside of those documents which are central to the meaning of the document. For this reason, commonly used words that are of little help in this sense, such as “the”, “what”, “those”, etc. are removed from the documents. While the following function removes all NLTK-defined stopwords from your documents, the function can take a list of additional custom stopwords as a parameter to remove those, as well.

no_stopwords = to_list(remove_stopwords(no_single_char_and_space))
no_stopwords[0]

Red blocks are used to identify areas that previously contained stopwords. (Image by Author)

The subsequent functions can be used to “lemmatize” or “stem” the documents. The idea is to group similar words together, so they can be identified as such when undergoing topic modeling. For example, lemmatizing the words “breaking” and “broke” would return “break” for each. Similarly, plural and singular words will be looked at as the same. The main difference to understand between the two different techniques is that lemmatized words are always still real words, while this is not true for stemmed words. Examples below.

lem_docs = to_list(lemmatize(no_stopwords))
lem_docs[0]

The words that have been lemmatized are highlighted. (Image by Author)

stem_docs = to_list(stem(no_stopwords))
stem_docs[0]

The words that have been stemmed are highlighted. (Image by Author)

After all this is done, we can return the list of words back into single string text document format.

clean_text_docs = to_list(bag_of_words_to_docs(no_stopwords))
clean_text_docs[0]

So, we’ve come a long way since our initial unprocessed text documents. At this point, everyone is probably wondering whether it is really necessary to have to type out ALL of those functions to get this done? Not to worry! There is one function that applies all of these steps with one swift press of a button. But wait! What if it is necessary to apply a custom step during all of the preprocessing? Once again, not to worry! Custom functions can be mixed into the preprocessing. The next block of code will demonstrate how.

Looking over some of these documents, it seems they might need more than your general preprocessing steps to extract relevant information off of them.

Cue the regex.

import re
pattern = re.compile(r"[\s\S]*\nLines:([\s\S]*)")def extract_lines(docs):
    for doc in docs:
        extract = pattern.match(doc).groups(0)[0]
        yield extract

Let’s take a look at how the document looks after matching on the regex pattern above.

Removing the subject and sender from the document (spoiler alert: the documents are emails) seems like a good idea in order to make sure the contents of the document remain relevant for when we get to topic modeling. Which brings us to our holy grail preprocessing function (Note: functions from data-describe can be specified using strings in the custom pipeline, while custom functions are entered as function objects).

clean_text_docs = preprocess_texts(text_docs, custom_pipeline=[
    extract_lines,
    'tokenize',
    'to_lower',
    'remove_punct',
    'remove_digits',
    'remove_single_char_and_spaces',
    'remove_stopwords',
    'stem',
    'bag_of_words_to_docs'
])
clean_text_docs[0]

Much better. At this point, we are ready for topic modeling. data-describe can help here as well, with a “widget” which allows for topic modeling of choice, whether it be LDA, LSA, LSI, NMF, or SVD. While this is all customizable, along with the number of topics and numerous other parameters, if no parameters are entered the function will automatically check the coherence values for the model for two topics all the way through ten topics. The number of topics with the highest coherence value is the model which will be returned.

from data_describe.text.topic_modeling import topic_modellda_model = topic_model(clean_text_docs, model_type='LDA')

The first thing we can do is look at the most important words per topic, which can be done by showing the model object in a new cell.

lda_model

There are a number of observations we can make here. We must keep in mind that these words were stemmed as part of the preprocessing pipeline. Generally looking at the topics, “Topic 1” is clearly about baseball and “Topic 2” is clearly about outer space. How did the model decide on six being the number of topics? Let’s take a look at the coherence values for each model with different numbers of topics.

lda_model.elbow_plot()

As expected, the coherence value is highest for the model with six topics. Next, we can look at the top documents associated with each topic. Naturally, the documents are lengthier than what Pandas might be used to so we can first increase the size of the column width for our Dataframe.

pd.set_option('display.max_colwidth', 500)
top_docs = lda_model.top_documents_per_topic(clean_text_docs_regex)
top_docs = top_docs[['Topic 1', 'Topic 2']]
top_docs.head(3)

We can take a closer look at the top document for Topic 1 below.

top_docs['Topic 1'].iloc[0][:700]

As expected, this document is about baseball. Looking over our topics, we see that while the first two topics are clear as to their meanings, the others are not as clear. This could be for a number of different reasons. Topic modeling is known to work significantly better on longer documents, and not as well for shorter texts such as the emails examined above. It’s possible more regex processing is required to deal with the quoted threads and reply-chains in emails. Maybe more tuning is required, beginning by looking at word frequencies (not yet a feature of data-describe, but we are happily accepting contribution efforts!). EDA is just the first step in the modeling process, and that is what data-describe is here to help with.

All of what is mentioned above is a great intro to text preprocessing and topic modeling using data-describe, but this just skims the surface of the package’s features. Others include data summaries, clustering, correlations, heatmaps, distribution plots, feature ranking, scatter plots, dimensionality reduction, and more, while providing support for big data and sensitive data, as well. Features and enhancements are constantly being added by the authors (including myself), and the open source community, so please feel free to reach out with any questions or to jump in and start contributing today!

Text Preprocessing with data-describe

Using new Python EDA package to clean messy and unstructured text

Written by Rishi Sheth