Semantic Search Engine

Case study of Stack Overflow Questions and Answers

Jayabalambika R

Published in

Towards Data Science

12 min readFeb 11, 2021

Introduction
Business Problem
Mapping to ML/DL problem
Understanding the Data
Data Processing
Exploratory Data Analysis
Approach to Search Problem
Pre-requisites
Architectures
Deployments
Conclusions and Future Work
Profile
References

1. Introduction

Ontology(information Science)

In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject[Wikipedia]

Semantic Search

Belongs to this branch of Ontology. It is based on the Humming Bird Algorithm. This algorithm lays more emphasis on natural language queries and considers more context and meaning rather than individual keywords.

Semantic technologies are meaning centred. They encompass the following[Wikipedia]

· encode/decode of semantic representation

· knowledge graph embedding relationships

· auto-recognition of topics and concepts

· information and meaning extraction

· semantic data integration, and

· taxonomies/classification

Therefore given any question, they directly search topics, concepts and association that span across various resources.

2. Business Problem

Screenshot of stack overflow search webpage

Stack Overflow is a question and answer forum. It's a website on Software programming for professional and enthusiastic programmers.

The key challenge for any company like stack overflow which has a humongous amount of data is to search similar questions in the entire archive of data in a limited turnaround time with optimum resource utilization.

The business problem is , Given a repository of questions and answers from stack overflow , quickly build and deploy in seven working days , a data driven application that is able to search and retrieve (top k) similar questions to the search string, in less than 500ms , with low computational/server cost

3.Mapping to Machine Learning Problem

The above business problem can be mapped in machine learning as a k-nearest neighbours search problem. Given a query vector (Qk), we need to find the k- nearest neighbours in the given set of questions {Q1, Q2, Q3, Q4, …Qn-1, Qn}.

We can compute the nearest neighbours using distance measures. The simplest measure of distance to compute is the Euclidean distance between the query vector and each of the data point in the given set of questions. This is also called the L2 norm of the vector

Euclidean Distance

There is another measure used to find similarity called the cosine distance and cosine similarity. Cosine similarity between two vectors. The range of values lies between [-1, 1]. As distance increases, cosine similarity decreases and vice versa.

4.Understanding the Data

screenshot highlighting fields in posts_questions table

Let us dissect the stack overflow webpage which is usually the representation of the data in the backend to the end-user. On carefully exploring the data available in Stack overflow archives.

To experiment with the simplest approach, we would first need to acquire data. Stack Overflow data is available in multiple sources. In this case study, we have extracted the data from Google cloud platform publicly available datasets.

For the sake of understanding let us relate the key components of interest in this way.

Stackoverflow: Schema

posts_questions: table

posts_answers: table

In the table, there are individual columns called fields

Therefore the key fields in the questions table are: id, title, tags, body

Similarly, the answers table contain: id, body and parentid

First created our credentials on GCP, then we can access the BigQuery platform on the Google cloud platform.

We can retrieve the data using the big query editor running SQL commands in the editor. This is the blog[1] from google regarding connecting to big query using credentials stored as JSON and downloading the data as JSON or as CSV files.

Using BigQuery SQL, we will retrieve the 100k records from the posts_questions and posts_answers table, the result being joined by accepted_answer_id in the posts_questions table with answer_id in the posts_answers table. Hence this design is a simple one, where we are fetching rows where one question has one corresponding answer(one — to — one mapping).

The sample SQL to extract stack overflow data after configuring the big query client. This result is stored in a JSON file first then we used the python library to convert from JSON to CSV for ease of migration of data accurately.

5. Data Pre-Processing

In data pre-processing, we need to clean and format the text for us to use them in the machine learning models. These are the first 5 rows of the data we extracted from the StackOverflow dataset using BigQuery

The questionbody, answerbody data needs to be cleaned and formatted. In this, we would need to remove HTML tags, special characters, de contract text.

Similarly, convert the answercreateddt to datetime using pandas `pd.datetime`, so that we could extract only the year of the answer for our data analysis.

Earlier while we were discussing the different columns in the posts_questions table, we encountered the tags field. This data is being stored in the database as “|” separated values. Hence in pre-processing step, we remove the “|” and concatenate it as a single string. The final pre-processed data after concatenating the title tag of the question with the question id is stored in a CSV file.

6. Exploratory Data Analysis

After pre-processing we perform exploratory data analysis on the resultant dataset. From the posts_answers table, we have extracted answer_created_dt versus the answer_score. We have extracted data from the year 2008 to 2010. A bar plot reveals that the most numbers of answers were answered in the year 2009, adding up to the answer_score and the least number of answers were created in the year 2010.

The Scatter plot of answer_id versus answer_score was plotted. It is clear from the plot that id’s in the range of 1000k to 2000k had the highest distribution of answer scores of above 80, whereas id’s beyond 3000k had the lowest distribution of scores below 20.

7.Approach to Search Problem

A question Qi in the set of questions is made up of a sequence of words.

This is a search problem and we can use a hash table to search the query Q* in the data corpus as it has O(1) search. We need to design a dictionary with keys and values. The standard hashing based techniques may not work very well in this search.

We have to use the inverted index method for this search[3]. If there are k words in the query Q*, we can search the result in O(k). This is called an inverted index because we are constructing an index with words as the keys and document numbers and frequency of the words occurring in these documents as the values. We are simply doing a keyword-based search here.

Elastic search is an open-source that implements Inverted index, scoring(TF-IDF), distributed and it is real-time. In this case, it’s a simple keyword search using an inverted index.

In semantic search, we construct the vector representation of the sentences and find the similarity between the query vector and vectors in the given corpus.

It has been observed with earlier approaches that semantic search using cosine similarity in the script score of elastic search, and is not scalable for very large datasets.

8.Pre-requisites

The reader of this article should be familiar with the basics of Python programming, Machine Learning and building data-driven applications.

9.Architectures

The following architectures experimented with a 100k dataset on a low-end laptop with i5 Processor and 8 Gb RAM

Clustering datapoints using K-means with glove vectors and fast Text for embedding
FAISS
Disk Approximate Nearest Neighbors

a. Clustering datapoints using K-Means with glove vector embedding

To use the K-Means clustering algorithm, we used the elbow plot to determine the appropriate value of the number of clusters. The value of the number of clusters as determined by the elbow plot was 200 on the reduced dataset of 30k.

Simple keyword search in this case returned the search results in 0.07 seconds.

But while using semantic search for words not in the glove_vectors returned zeros and cosine similarity score failed.

Used fast text to get the embeddings for the words not found in glove vectors. With this approach, measured the silhouette score for the 30k dataset to be 0.05.

Since the silhouette score is near zero, the samples are on very close decision boundaries with neighbouring clusters. Hence searching inside the cluster might not work well for these datapoints and hence moving on to the next architecture FAISS.

b. FAISS

This is the nearest neighbour search built by Facebook research on a billion scale dataset and reported to be 8.5 times faster than the previous SOTA methods for nearest neighbour search.

For similarity search, we need the following operations on billions of vectors

1) Given a query vector we need to find the list of vectors that are nearest neighbours to the vectors using Euclidean distance

2) Given a query vector, find the list of vectors that return the highest dot product.

Semantic similarity search basically involves two operations.

a) Indexing

b) Searching

a)Indexing: FAISS library is built around the Index object. The list of vectors is first preprocessed to search faster. The dimensionality of the vectors is chosen by the sentence vector transformer model we are choosing for encoding the vectors. In our case, we have chosen the sentence transformer model a modification of the pre-trained BERT network that use Siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. The detailed code to create the index is available on GitHub.

We can serialize and deserialize the index for transferring over a network.

Before, reading a serialized index from a pickle file, deserialize it and then read the bytes from the file.

There are two methods for generating an index. The CPU index and GPU index. The difference is in the resources used while executing the code for index generation. The former uses CPU resources while the latter uses GPU resources.

b) Searching: The basic operation that can be performed on the index constructed is the k-nearest neighbour search. For query vector, find the k- nearest neighbours in the corpus.

1.Normal brute force search on GPU

Invoking the knn function on the query_vector and encoded_document vector returns result in 0.1 s on a GPU. In this case, the index is constructed using IndexFlatL2, which is a simple L2 distance computation, it does not work well in this case.

2.Constructing the index using pure Inner Product

The index was constructed using IndexFlatIP, which is cosine similarity computation and found to give good results on this dataset.

An experiment was conducted both on CPU resources and GPU resources(using collaboratory). The search part of the code remains the same across all the resources.

3.Location Sensitive Hashing

Constructing the index by passing the dimensions and number of bits as arguments, in this case, the study used d =768 and nbits =1024. Choosing nbits < 1024, affected the returned results. Therefore a minimum of 128 bytes was required to represent the document vectors. Instead of projections being calculated here, the calculations are optimized using random rotations of the vectors. This is an exhaustive search too and returns the knn in 0.01s.

If we want to reduce the memory footprint of the index and store, then we would need to perform a lossy compression called Product Quantization.

The trade-off for memory would be accuracy.

After experimenting with ProductQuantizer using inverted list(IVF) on the vectors, the index vector constructed for the big dataset 10 Million was close to 3GB and unable to load the dataset in RAM and the colaboratory session crashed. In addition to that from the IVF form of indexing, we could retrieve only approximate distances.

The index on which Product Quantizer is applied is of very low accuracy, therefore a re-ranking needs to be performed with accurate distance computations which is a costly operation on limited RAM.

Comparison of latency across different architectures

Comparison of various search results

For a query string: “jQuery plugins for the event”, In case of the model1 built using clustering initializing it with k-means++, the query_vector was predicted to belong to cluster[52]. Searching for the 5 nearest neighbours in that cluster gave the following results.

Clearly, our clustering model failed in this case and it got reflected in the silhouette score calculated earlier with overlapping decision boundaries for points. To verify that the clustering model failed, we plotted the scatter plot of two clusters say 52 and 54. The points are highly overlapping in nature.

Observations

Model3: Disk Approximate nearest neighbours, sample node input was given and the pruned nodes output was obtained with limited nodes list during the unit test of the code. Please refer to the sample output at the end of the next section.

From the above results for the query, we found that IndexFlatIP worked well in our case study compared to Brute force or LSH. Though LSH returned the results in 0.01s, the results in the case of IndexFlatIP are more similar to the query. Hence, we deployed the CPU version of the FAISS with IndexFlatIP on Streamlit and the application demo is available in the Deployments section.

c. DISKANN: Fast, Accurate Billion Point Nearest Neighbor Search on a Single Node

This is an Approximate Nearest neighbour search. The main advantage of this approach is, the approach can index and store and search Billion point dataset on a 64 Gb RAM and SSD.

We have modified the dataset used for earlier approaches. Instead of using 1 question to one accepted answer mapping, we have used 1 question to many answers to construct the Directed Acyclic Graph as detailed in the research paper.

This is the paper was presented by the Microsoft Research team at the NeurIPS conference[8]. This algorithm is faster in-memory search, faster indexing and more scalable.

The Vamana algorithm incorporates a greedy search first followed by robust pruning of the nodes.

The code for this algorithm is implemented in Python and complete code is available in the GitHub link.

It has been tested on sample nodes. A simple unit test was done on the greedy_search and robust prune function. Parallel joblibs were used to utilize multiprocessing in Python.

10.Deployments

Streamlit is a light weight framework used to create Data apps. The streamlit library is entirely in Python. It is built upon 3 simple principles. 1.Python Scripting 2.Widget/Component interaction 3.Deploy instantly.

The FAISS CPU semantic search model was used to build the data-driven application in streamlit. Below are the EDA and Word Cloud of the dataset used to build the app.

Exploratory data analysis on the dataset

Please find the link to the webcast for the simple app on streamlit

self_case_study1_recording.mp4

Edit description

drive.google.com

11.Conclusions and Future Work

We can conclude that for a large dataset ≤ 100k FAISS works well but again RAM becomes a limitation. Without any doubt, Disk ANN outperforms these architectures for very large datasets in terms of index construction, storing on SSD and Search.

I am pursuing the following ideas on

Hyperparameter tuning for DISKANN algorithm code.
Using other fields in the dataset for search and compute metrics for comparison

12.Profile

Full source code is available on the GitHub link

If you have any questions or suggestions on any of the above please connect with me on LinkedIn