Finding Duplicate Quora Questions Using Machine Learning

Arun Mohan
Towards Data Science
10 min readSep 20, 2020

--

Photo by Evan Dennis on Unsplash

Quora is an amazing platform where questions are asked, answered, followed, and edited by internet companies. This empowers people to learn from each other and to better understand the world. About 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. It's not a better side from quora to ask its followers to write an answer for the same question. So it will be better if there is a system that is capable of detecting that a new question is similar to the questions that have already been answered.

So our problem statement is to predict whether a pair of questions are duplicates or not. We will use various machine learning techniques to come up with a solution for this. This blog is not a complete code walkthrough, but I will explain various approaches I used to solve the problem. You can have a look at my code from my Github repository.

Some business constrains

  • The cost of misclassification can be very high. ie, if a user asked a particular question and if we provide some other answer, then it is not good. It will affect the business. This is the most important constrain.
  • We want the probability of a pair of questions to be duplicated so that you can choose any threshold of choice. So depending upon use case we can change it.
  • We don’t have any latency requirements.
  • Interpretability is partially important. ie, we don’t want users to know why a pair of questions is duplicated. But if we know that it will be better.

Performance metric

Here we have a binary classification task and we want to predict a pair of questions is duplicate or not. We will use log loss as a metric. It makes sense since we are predicting a probability value, it makes sense to use log loss as a metric. It is our primary KPI(key performance indicator). We will also use the confusion matrix for measuring performance.

Log loss is nothing but negative of log of product of likelihoods

Exploratory Data Analysis

data

We have about 404290 data points and 6 columns. The 6 columns/features are:

  • id: A unique id for the question pair
  • qid1: id of the first question.
  • qid2: id of the second question
  • question1: the first question
  • question2: second question
  • is_duplicate: Whether both are duplicate or not.

As initially approach, I checked for missing values. I found in 3 rows there are missing values (in question1 and question2). So I dropped those rows. Next, I checked for any duplicate rows. But there were no such rows.

Analysis On Target Variable

distribution of target

Clearly we have imbalanced data and the number of duplicate questions outnumbers the nonduplicates.

Analysis On Questions

After analyzing the number of questions I came up with the following observation:

unique vs repeated questions
Total number of unique questions is 537929
Number of questions that repeated more than 1 time is 111778 which is 20.779322178205675%
The maximum number of times a question occured is 157

After that, I tried plotting a histogram of the number of question occurrences and log of the number of questions. We can see that most of the questions have occurrences approximately < 60. We can see there is 1 question that occurs 167 times, 1 question that occurs 120 times 1 question that occurs 80 times, and so on.

log histogram of question appearances

Now we have a broad understanding of data. Next, I created some basic feature features before preprocessing the data.

Basic Feature Engineering

I created the following features:

  • freq_qid1 = Frequency of qid1’s.number of times question1 occur.
  • freq_qid2 = Frequency of qid2’s.number of times question1 occur.
  • q1len = Length of q1
  • q2len = Length of q2
  • q1_n_words = Number of words in Question 1
  • q2_n_words = Number of words in Question 2
  • word_Common = (Number of common unique words in Question 1 and Question 2).
  • word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
  • word_share = (word_common)/(word_Total)
  • freq_q1+freq_q2 = sum total of the frequency of qid1 and qid2
  • freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

Analysis On engineered features

word share among similar and dissimilar questions

We can see that as the word share increases there is a higher chance the questions are similar. We know that for pdf’s as they overlap more the information that distinguishes the classes will be less. From the histogram, we can see that word share has some information differentiating similar and dissimilar classes.

common_words among similar and dissimilar questions

We can see that common_words does not have enough information separating classes. The hist plots of word_common of duplicate and non-duplicate questions are overlapping. Not much information can be retrieved as most of pdf’s is overlapping.

Advanced Feature Engineering

Now we will create some advanced features using the data we have. Before that, we will clean our text data. As a part of text preprocessing, I have removed stopwords, punctuations, special characters like “₹”,”$”,”€” etc, and also I have applied stemming to get more generalization. Next I engineered the following features.

Note: In below features tocken means words obtained by splitting text and words means tockens which are not stopwords.

  • cwc_min : Ratio of common_word_count to min length of word count of Q1 and Q2
    cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
  • cwc_max : Ratio of common_word_count to max length of word count of Q1 and Q2
    cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
  • csc_min : Ratio of common_stop_count to min length of stop count of Q1 and Q2
    csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
  • csc_max : Ratio of common_stop_count to max length of stop count of Q1 and Q2
    csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
  • ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
    ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
  • ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
    ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
  • last_word_eq : Check if last word of both questions is equal or not
    last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
  • first_word_eq : Check if First word of both questions is equal or not
    first_word_eq = int(q1_tokens[0] == q2_tokens[0])
  • abs_len_diff : Abs. length difference
    abs_len_diff = abs(len(q1_tokens) — len(q2_tokens))
  • mean_len : Average Token Length of both Questions
    mean_len = (len(q1_tokens) + len(q2_tokens))/2
  • fuzz_ratio : Here comes the interesting part. Fuzz ratio depends upon the Levenshtein distance. Intutively saying if the corresponding edits required from one sentance to become other is large, fuzz ratio will be small. ie, fuzz ratio will be similar for most similar words.
eg: s1 = “mumbai is a great place”   s2 = "mumbai is a nice place"
fuzz ratio = 91
  • fuzz_partial_ratio : In certain cases fuzz ratio cannot solve the issue.
fuzz.ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 75

Both s1 and s2 mean the same. But their fuzz ratio can be smaller. So we will find the ratio for partial sentences and it will be high. In such a case, it is known as a fuzz partial ratio.

fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60
  • token_sort_ratio: In some other cases even fuzz partial ratio will fail.

For example:

fuzz.partial_ratio("MI vs RCB","RCB vs MI") ⇒ 72

Actually both the sentence have the same meaning. But the fuzz ratio gives a low result. So a better approach is to sort the tokens and then apply fuzz ratio.

fuzz.token_sort_ratio("MI vs RCB","RCB vs MI") ⇒ 100
  • token_set_ratio: There is another type of fuzz ratio which helps even I cases where all above fails. It is the token set ratio.

For that we have to first find the following:

t0 -> find the intersection words of sentance1 and sentance2 and sort them.

t1-> t0 + rest of tokens in sentance1

t2-> t0 + rest of tokens in sentance2

tocken_set_ratio = max(fuzz_ratio(to,t1),fuzz_ratio(t1,t2),fuzz_ratio(t0,t2))

  • longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2.
s1-> hai, today is a good day
s2-> No, today is a bad day

Here longest common substring is “today is a”. So we have longest_substring_ratio = 3 / min(6,6) = 0.5
longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

More data analysis

Now we will plot Word cloud for duplicate sentences and nonduplicate sentences. I have plotted it after removing stopwords to get a better understanding.

Word cloud for duplicate sentences
Word cloud for nonduplicate sentences

Larger the words, the larger the frequency of repetition in the corpus. We can see that words like Donald trump, rupee, the best way are mostly repeating in case of duplicate sentences while words like difference, India, use, etc are most repeating in nonduplicate sentences.

Pair plot for ‘ctc_min’, ‘cwc_min’, ‘csc_min’, ‘token_sort_ratio’

pair plot

From the pair plot, we can see that there is some useful information in all the features that separate duplicate and nonduplicate sentences. Out of which token sort ratio and ctc min do a nice job than others.

The absolute difference in the number of words between questions

The absolute difference in the number of words between questions

We can see that most questions differ by the only word. There is only a very small number of questions with a huge difference.

TSNE Visualization

Next, I tried to have a lower-dimensional visualization of the data. I randomly sampled 5000 datapoints and used TSNE for lower dimensional visualization. I used only the features we recently engineered to see its impact on analysis. We see that in some regions the classes and well separated. So we can say that there is now much information in our model to perform a good classification.

Note: You can always experiment with more data points and perplexity(If you have enough computation power) as it will give much more information.

TSNE visualization

Train test split

We did a 70:30 split. ie, we placed 70% of data points for training and the rest 30% for testing.

Vectorizing text data

Before creating models using the data, we have to vectorize the text data. For that, we used 2 approaches

  • TFIDF vectorization
  • TFIDF weighted GLOVE vectorization

We saved separate files after merging these vectors with the initial features we created.

Machine Learning Models

Let us get into the most interesting part of this blog- Creating machine learning models. Now we have two data frames for training- one using tfidf and the other with tfidf weighted glove vectors.

Logistic regression

TFIDF features:

On training Logistic regression of TFIDF data, we end up with a log loss of about 0.43 for train and .53 for the test. Our confusion precision and recall matrixes look as follows:

confusion matrix / precision matrix/ recall matrix

Everyone must be aware of how to interpret the normal confusion matrix. I am not getting into its explanation. Let us see how we can interpret the precision matrix. In the above figure, the second one is the precision matrix.

Intuitively precision means, of all the points, predicted as positive, how much is actually positive. Here we can see that of all the labels predicted as class 1, 24.6% belongs to class 2, and the rest 75.4 % belongs to class 1. Similarly, for all the points predicted as class 2, 69.2% belong to class 2 and 30.02% belong to class1. Here precision for class 1 is 0.754 and class2 is 0.30.

The third matrix we have is the recall matrix. Intuitively recall means, of all the points belonging to a class, how much of them are predicted correctly. In the recall matrix, of all the labels belonging to class1, 86.5% were predicted as class1 and 13.5% is predicted as class2. similarly, Of all the original points belonging to class2, 59.9% belong to class2 and the rest belongs to class1.

TFIDF Weighted GLOVE:

After hyperparameter tuning, we end up with a log loss of about 0.39 for the test and 0.38 for the train.

confusion matrix/ precision matrix / recall matrix

From both recall matrixes, we can see that we have a low recall value. Let us see how it performs with a linear SVM.

Linear SVM

TFIDF features:

Train log loss: 0.3851

Test log loss: 0.3942

Confusion matrix:

confusion matrix / precision matrix / recall matrix

TFIDF Weighted GLOVE:

Train log loss: 0.3876

Test log loss: 0.395

Confusion matrix:

confusion matrix/ precision matrix/ recall matrix

We can see for both Logistic regression and Linear SVM, we end up with low recall value. But both models are not that much overfitting. So I thought of a high bias problem here. If we opt for some boosting methods, we may overcome this a may get a better recall value. With this intention, I tried out XGBoost on both tfidf and tfidf weighted glove features.

XGBOOST

TFIDF features:

Train log loss: 0.457

Test log loss: 0.516

Confusion matrix:

confusion matrix/precision matrix/recall matrix

TFIDF Weighted GLOVE features:

Train log loss: 0.183

Test log loss: 0.32

Confusion matrix:

confusion matrix/precision matrix/recall matrix

We can see that using Xgboost our recall of both classes improved slightly. Out of all these xgboost on glove vectors performs better.

results

For a complete code walkthrough, you can visit my GitHub repository.

References

  • Appliedaicourse

--

--