Using Full Text Search

If you’ve used SQL to perform a text search, you would have probably used the like command. But the limitation with like command is that it looks for exact matches. Luckily for us, SQL offers a feature – SQL FULL TEXT INDEX – that offers fuzzy text search capability on any column that contains raw text. this is a god sent for NLP projects.
I for one, am a big fan of NLP libraries offered by python – scikit learn and spaCy.
But before one steps into the deep waters of the text processing, its good to dip your toe into the shallow end with SQL’s full text Index.
In this article we are going to see how one can use the Text search capabilities of MS SQL to perform an elastic search. Yes, a full fledged elastic search.
Dataset
We will be using the Amazon reviews dataset available at:
Lets focus on one product category

Import the downloaded json into python and have a look at the data.

Push the data into SQL

Setting up the Full Text search feature
First a little housekeeping.
Within SQL, check if the feature Full text is installed in the SQL Server
If its not installed, use the following link to do so.
Check if its enabled.
Create a Full Text catalog on the AmazonReviews table.
Think of this as the place where the index will be populated..
Populated… What does that mean..?
We’ll find out shortly…
For now, look at the primary key of the table AmazonReviews.

Hmm..
The table had ID as the primary key.
In the index, its named as PK_AmazonRe_3212…
Let us rename it for easy access later.

There, that looks better.
Now, the part we’ve been waiting for…
Populate, i.e. fill the Full text Index. This enables us to perform super fast searches on the reviews text.
Lets have a look at the contents of the Full text.

This is what we meant by populating the index.
Its has 121995 entries. It has a track of which term occurs in which column, and how many times.
Think of it like the vocabulary from Count Vectorizer we create in scikit learn.
And finally, lets perform a search operation on the reviewText column.
Lo and Behold…!

28000 rows with matching results.
Lets explore further…

Rows that rank higher contain exact match with the search term "low battery". While on the contrary, rows that rank lower contain partial matches – like only the word "low" or only the word "battery".
Additionally, notice that as we scroll through the results, the length of the reviews increases.
How does Ranking work?
At this point its interesting to understand how the rows are ranked.
Its ranked based on the OKAPI BM formula.
The formula in its entirety is:

But really it boils down to the below factors:
· Number of words in query
· Number of rows in DB matching the queried words
· Average length of the rows in DB
· Frequency of word in DB
· Frequency of word in the query
That explains why the matched row with 12330 characters has the lowest rank score and the matched row with 141 characters has the highest rank score.
And that there is a quick and easy to perform an elastic search using Full Text search in SQL. This ability to perform fuzzy search and rank and sort the results forms the core of any search engine. The best part of this feature is the index can be setup to auto-update when the base data updates.
In addition to using it as for its search capabilities, I have personally used the Full text index at the beginning of many NLP projects. For example, going through the index and the count of terms gives me an idea of the domain specific stop words. Scrolling through the index and looking at misspelled words or typos, gives me a clue to the kind of text cleaning steps I need to perform.
Thank you for making it this far. I’m going to be writing more such posts. Let me know what you think of this post and what other topics you’d like to see covered by messaging me on LinkedIn. For future posts, follow me on Medium.