Quora Insincere Questions Classification

A case Study on Kaggle Competition hosted by Quora for improving its online conversations.

Ronak Vijay
Towards Data Science

--

Logo (Source)

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those founded upon false premises, or that intend to make a statement rather than looking for helpful answers.

Quora Insincere Questions classification was the second kaggle competition hosted by quora with the objective to develop more scalable methods to detect toxic and misleading content on their platform.

Problem Statement

Detect toxic content to improve online conversations.

Data Overview

Dataset contains training set of over 1,300,000 labeled examples and test set with over 300,000 unlabeled examples. Each example in the training set has a unique id, the text of the question, and a label of ‘0’ or ‘1’ to represent ‘sincere’ or ‘insincere’.

What is an insincere question?

An insincere question is defined as a question intended to make a statement rather than look for helpful answers.

  • Has a non-neural tone.
  • Is disparaging or inflammatory.
  • Isn’t grounded in reality.
  • Uses sexual content(incest, bestiality and pedophilia) for shock value, and not to seek genuine answers.

As this was a kernel only competition, external data sources were not allowed. We have to submit the kaggle kernel(either notebook or script) with all the code and output predictions in the specific format as mentioned in the submission requirements.

Along with the question’s text data, quora had also provided 4 different embedding files trained on large corpus of data that can be used in the models. Given embedding files are as follows:

In each of these embedding files words are represented as 300 dim vectors. This representation of words will allow to capture semantic meaning of words. Words with same meaning will have a similar vector representation.

Machine Learning Problem

It is a binary classification problem, for a given question we need to predict if it is a insincere question or not.

Metric

Evaluation metric for this competition was F1-Score which is the harmonic mean of precision and recall. Precision is the percentage of classified insincere question that are actually insincere and Recall is percentage of actually insincere questions that were correctly classified as insincere.

F1-Score (Source)

F1 Score is a better metric to use if we need to seek a balance between Precision and Recall. It is also preferable over accuracy when there is an uneven class distribution.

Exploratory Data Analysis

In this section we will explore and visualize the given dataset.

First, let’s load the training and test datasets.

Training dataset contains columns: qid, question_text and a target(binary value). All observations are unique with non null values.

Distribution Plot for training set

Data Distribution
  • Dataset is highly imbalanced with only 6.2% of insincere questions.
  • F1-Score seems to be right choice than accuracy here because of data imbalance.

Word cloud for both sincere and insincere questions

Word cloud for sincere questions
Word cloud for insincere questions
  • As we can see insincere questions contain many of the offensive words.
  • Most of the insincere questions in the provided training set are related to People, Muslim, Women, Trump, etc.

Feature Engineering(before cleaning)

Let’s construct some basic features like:

  • Number of words
  • Number of capital_letters
  • Number of special characters
  • Number of unique words
  • Number of numerics
  • Number of characters
  • Number of stopwords

Box plot for generated features

Box plots for generated features
  • Insincere questions seems to have more words and characters.
  • Insincere questions also have more unique words compared to sincere questions.

Machine Learning based approach

Data Preprocessing and cleaning

I have followed the standard data preprocessing techniques used for solving NLP based tasks like spelling correction, removing stopwords, punctuations and other tags followed by lemmatization.

Lemmatization is a process of replacing the word with its root word using a known word dictionary.

More details on data preprocessing and other NLP approaches can be found in this blog.

TF-IDF

It stands for term frequency — inverse document frequency.

TF-IDF (Source)

Term frequency: probability of finding a word in the document.

Inverse document frequency: defines how unique is the word in the total corpus.

TF-IDF is the multiplication of TF and IDF values. It gives more weightage to words which occurs more in the document and less in the corpus.

Logistic Regression

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a sigmoid function.

Using Logistic Regression as a simple baseline model with question’s texts TF-IDF encoded plus basic hand crafted statistical features gave F1-Score of 0.58.

Confusion matrix for logistic regression

Deep learning based approach

Unlike machine learning based approach, here I haven’t removed stopwords instead allowed them to be learnt by the model and also added space between punctuations so as to find more word embeddings.

A word embedding is a form of representing words using a dense vector representation. I have used given pretrained word embedding files for creating embedding matrix.

Also, I have used spacy tokenizer so that to construct word and lemma dictionaries. These 2 dictionaries were used in creating embedding matrix.

For increasing embeddings coverage I have used the combination of word stemming, lemmatization, capitalization, lowercase, uppercase as well as embedding of the nearest word using spell checker to get embeddings for all words in vocab.

Created two separate embedding matrices with glove and para embedding files. Finally, taken weighted average of them giving higher weightage to glove.

This step will create an embedding matrix of size (No. of words in vocab, 300) where each word is represented as a 300 dimensional vector.

After creating the embedding matrix, builded an ensemble of three different model architectures to capture different aspects of dataset and thus increasing overall F1-Score. These models are as follows:

Model 1: Bidirectional RNN(LSTM/GRU)

Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous steps are fed as input to the current step, thus remembers some information about the sequence. It has limitations like difficulty in remembering longer sequences.

LSTM/GRU are improved versions of RNN, specialized in remembering information for an extended period using a gating mechanism which RNN fails to do .

Bidirectional RNN (Source)

Unidirectional RNN’s only preserves information of the past because the inputs it has seen are from the past. Using bidirectional will run the inputs in two ways, one from past to future and one from future to past allowing it to preserve contextual information from both past and future at any point of time.

Model 2: Bidirectional GRU with Attention Layer

Model 2 consists of Bidirectional GRU, Attention and pooling layers.

Attention in Neural Networks is (very) loosely based on the visual attention mechanism found in humans. Human visual attention is able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution”.

Textual Attention example (Source)

Attention is used for machine translation, speech recognition, reasoning, image captioning, summarization, and the visual identification of objects.

Attention Weighted Average

Attention layer uses weighted mechanism where all the words in a sequence are given a weight(attention value). This weight defines the importance of a word in a sequence thus capturing important words and giving more attention to the important parts in a sequence. These weights are trainable.

Bidirectional LSTM model with Attention (Source)

The attention layer will let the network sequentially focus on a subset of the input, process it and then change its focus to some other part of the input.

More in-depth explanation of Attention layer’s keras code can be found in this document.

Model 3: Bidirectional LSTM with Convolution Layers

Model 3 consists of Bidirectional LSTM followed by Convolutional and pooling layers.

The idea behind using CNNs in NLP is to make use of their ability to extract features. CNNs are applied to embedding vectors of a given sentence with the hopes that they’ll manage to extract useful features(such as phrases and relationships between words that are closer together in the sentence) which can be used for text classification.

The NLP CNN is usually made up of 3 or more 1D convolutional and pooling layers unlike traditional CNNs. This helps reduce the dimensionality of the text and acts as a summary of sorts which is then fed to a series of dense layers.

CNN for textual data (Source)

In computer vision tasks, the filters used in CNNs slide over patches of an image whereas in NLP tasks, the filters slide over the sentence matrices, a few words at a time.

AdamW

For all of these three models a customized Adam optimizer is used called as AdamW(weight decay) which fixes weight decay regularization in Adam.

This paper points out that all the popular Deep Learning frameworks (Tensorflow, Pytorch) have implemented Adam with weight decay wrong. They made the following observations:

  • L2 regularization and weight decay is not the same.
  • L2 regularization is not effective in Adam.
  • Weight decay is equally effective in both Adam and SGD.

They propose AdamW that decouple weight decay and L2 regularization steps.

Comparison between Adam and AdamW (Source)

More details on AdamW can be found here.

Training and submission

Trained three models with StratifiedKFold cross validation and then taken mean of their predictions as final ensemble predictions.

“Out-of-fold” list was created with all folds(validation) which is then used to find the optimal threshold of the ensemble model.

Validation score with optimal threshold

At last, using the optimal threshold on test predictions, submission file is generated.

Conclusion

This kaggle competition focuses on detecting toxic content on Quora platform using machine learning and deep learning methods.

I have learned a lot from kaggle forums and public solutions. Finding more word embeddings and model ensembling were the key factors for improving the F1-score.

Kaggle private and public scores for this solution are as follows.

Kaggle Final Score

Further improvements can be achieved by trying with other pretrained models like BERT for word embedding.

--

--