The world’s leading publication for data science, AI, and ML professionals.

Build a search engine with SQL…

Using Full Text Search

Photo by Marten Newhall on Unsplash
Photo by Marten Newhall on Unsplash

If you’ve used SQL to perform a text search, you would have probably used the like command. But the limitation with like command is that it looks for exact matches. Luckily for us, SQL offers a feature – SQL FULL TEXT INDEX – that offers fuzzy text search capability on any column that contains raw text. this is a god sent for NLP projects.

I for one, am a big fan of NLP libraries offered by python – scikit learn and spaCy.

But before one steps into the deep waters of the text processing, its good to dip your toe into the shallow end with SQL’s full text Index.

In this article we are going to see how one can use the Text search capabilities of MS SQL to perform an elastic search. Yes, a full fledged elastic search.


Dataset

We will be using the Amazon reviews dataset available at:

Amazon review data

Lets focus on one product category

Screenshot captured by author
Screenshot captured by author

Import the downloaded json into python and have a look at the data.

Screenshot captured by Author
Screenshot captured by Author

Push the data into SQL

Screenshot captured by Author
Screenshot captured by Author

Setting up the Full Text search feature

First a little housekeeping.

Within SQL, check if the feature Full text is installed in the SQL Server

If its not installed, use the following link to do so.

Hands on Full-Text Search in SQL Server

Check if its enabled.

Create a Full Text catalog on the AmazonReviews table.

Think of this as the place where the index will be populated..

Populated… What does that mean..?

We’ll find out shortly…

For now, look at the primary key of the table AmazonReviews.

Hmm..

The table had ID as the primary key.

In the index, its named as PK_AmazonRe_3212…

Let us rename it for easy access later.

Primary key of the table
Primary key of the table

There, that looks better.

Now, the part we’ve been waiting for…

Populate, i.e. fill the Full text Index. This enables us to perform super fast searches on the reviews text.

Lets have a look at the contents of the Full text.

Contents of the FT index
Contents of the FT index

This is what we meant by populating the index.

Its has 121995 entries. It has a track of which term occurs in which column, and how many times.

Think of it like the vocabulary from Count Vectorizer we create in scikit learn.

And finally, lets perform a search operation on the reviewText column.

Lo and Behold…!

Exact Matches rank higher
Exact Matches rank higher

28000 rows with matching results.

Lets explore further…

Lower end of the matches
Lower end of the matches

Rows that rank higher contain exact match with the search term "low battery". While on the contrary, rows that rank lower contain partial matches – like only the word "low" or only the word "battery".

Additionally, notice that as we scroll through the results, the length of the reviews increases.

How does Ranking work?

At this point its interesting to understand how the rows are ranked.

Its ranked based on the OKAPI BM formula.

The formula in its entirety is:

Formula from : https://docs.microsoft.com/en-us/sql/relational-databases/search/limit-search-results-with-rank?view=sql-server-ver15
Formula from : https://docs.microsoft.com/en-us/sql/relational-databases/search/limit-search-results-with-rank?view=sql-server-ver15

But really it boils down to the below factors:

· Number of words in query

· Number of rows in DB matching the queried words

· Average length of the rows in DB

· Frequency of word in DB

· Frequency of word in the query

That explains why the matched row with 12330 characters has the lowest rank score and the matched row with 141 characters has the highest rank score.


And that there is a quick and easy to perform an elastic search using Full Text search in SQL. This ability to perform fuzzy search and rank and sort the results forms the core of any search engine. The best part of this feature is the index can be setup to auto-update when the base data updates.

In addition to using it as for its search capabilities, I have personally used the Full text index at the beginning of many NLP projects. For example, going through the index and the count of terms gives me an idea of the domain specific stop words. Scrolling through the index and looking at misspelled words or typos, gives me a clue to the kind of text cleaning steps I need to perform.


Thank you for making it this far. I’m going to be writing more such posts. Let me know what you think of this post and what other topics you’d like to see covered by messaging me on LinkedIn. For future posts, follow me on Medium.


Written By

Topics:

Related Articles