Natural Language Processing: Intelligent Search through text using Spacy and Python

Extract useful information from text using Python and Machine Learning

Akash Chauhan
Towards Data Science

--

Searching through text is one of the key focus areas of Machine Learning Applications in the field of Natural Language.

But what if we have to search for multiple keywords from a large document (100+ pages). Also, what if we have do a contextual search (searching for similar meaning keywords) with in our document!The conventional ‘CTRL + F’ solution would either take long long hours to accomplish this task (or in case of contextual search, it will not be able to find any meaning text).

This article will help the readers understand how we can use Machine Learning to solve this problem using Spacy (a powerful open source NLP library) and Python.

Data Pre-Processing

The initial step in any building any machine learning-based solution is pre-processing the data. In our case, we will be pre-processing a PDF document using PyPDF2 package in Python and then convert the entire text into a Spacy document object. For readers who have not worked on Spacy — It is an advanced open source library in Python used for various NLP tasks. For users who are interested in learning more about Spacy, please refer this link for reading the documentation and learning more about Spacy — https://spacy.io/

We will first load the PDF document, clean the text and then convert it into Spacy document object. The following code can be used to perform this task-

Data Pre-Processing

First we will have to load Spacy’s ‘en_core_web_lg’ model which is a pre-trained English language model available in Spacy. Spacy also provides support for multiple languages (more can be found in the documentation link). Also, Spacy has multiple variations for models (small, medium and large) and for our case we will be working with large model since we have to work with word vectors which is only supported with the large model variant.

The ‘setCustomBoundaries()’ is used as a customer sentence segmentation method as opposed to the default option. The same method can be modified depending upon the document we are dealing.

Once we have the Spacy’s document object ready, we can move to the next part of handling the input query (keywords) that we need to search for in the document.

Handling Query — Finding Similar Keywords

Before moving on to the coding part, let's look at the broader approach we are following in order to get more accurate search results from the document we are searching.

Process Workflow

Up until the data preprocessing stage we have already converted our PDF document text to Spacy’s document object. Now we also have to convert our keywords to Spacy’s document object, convert them into their equivalent vector form ((300, ) dimension) and then finding similar keywords using the cosine similarity. At the end we will have an exhaustive list of similar keywords along with the original keywords that we can now search through our document to generate accurate results.

Refer the below code to perform this task-

Generate Similar Keywords

Now that we have found contextually similar words to our original keywords, let's work on the final searching part.

Searching Keywords through Text

For searching, we would be using the PhraseMatcher class of Spacy’s Matcher class. At this point, it is important to remember that Spacy’s document object is not as same as a simple python string and hence we cannot directly use if then else to find the results.

Refer to the below code to perform this task-

Searching through text

The above code will search for every keyword we have through the entire text and will return us the entire sentence wherever it has found a match.

The above code will generate the following output-

Search Output Screenshot

You can increase or decrease the number of similar keywords that you want to find for any original keywords. Also, once you get the results in a dataframe, you can simply add some more logics for ranking the results (give more weightage to exact keyword match and so on).

Note: Increasing number of Similar Keywords to large number may increase the computational cost of the overall program and hence should be chosen wisely.

So this is how you can create your own ML based Python program for performing search through any text.

In case of any other input source (Photographs, Web pages etc.) you just need to customize the data preprocessing part (OCR, Web Scraping etc.) and the rest of the logic should perform just fine.

--

--