Kaggle 3rd Place Solution — Jigsaw Multilingual Toxic Comment Classification

Approach, Learnings, and Code

Published in

Towards Data Science

6 min readSep 4, 2020

I had recently participated in the Jigsaw Multilingual Toxic Comment Classification challenge at Kaggle and our team (ACE team) secured 3rd place on the final leader board. In this blog, I describe the problem statement, our approach, and the learnings we had from the competition. I have also provided the link to our full code towards the end of the blog.

Problem Description

The goal of the competition was to detect toxicity (for example rudeness, disrespect, or threat) in user comments during online interaction. This is a very real-life problem given the increase in trolling and hatred on social media etc these days.

The problem was set up as a vanilla text classification problem where one had to predict the probability of a (text) comment being toxic and the submissions were evaluated on a ROC AUC metric

The most interesting part of the problem was that — the toxicity had to be identified in 6 different languages, and there was no training data provided at all for 3 of them, and there was very little training data for the remaining 3.

The competition was organized by Jigsaw, which is a Google company, that builds tools to deal with problems like toxicity, disinformation, harassment radicalization in online content. This was the third toxicity detection competition conducted by Jigsaw (although the first one that was multilingual in nature).

Why is the problem challenging?

Detecting toxicity is a lot more than just detecting abusive words in the text. As an example, consider the following comment, which doesn’t have any abusive words, but is still toxic

I just wanted to have a little fun, captain buzz kill. I guess you never had a life when you were young. that was probably like, 100000000000000000 years old

In addition to this, if you are trying to solve the problem across multiple languages, it becomes quite a handful.

Solution Overview

The figure above shows high-level overview of various components of our final solution, along with their individual scores (ROC) and weights, that led to our final score of 0.9523.

The final solution was a blend of diverse underlying models trained in a variety of ways. “Roberta” in the above figure represents models derived off RoBERTa XLM architecture and Mono Bert represents Bert Base models pre-trained on the text of a specific language.

Technical Overview

The flowchart below gives high-level technical overview of our solution. I’ll share various fine details in the section that follows

Training Data / Pre-processing

We used the following datasets for model training in the competition:

Labeled English data provided by the organizer from the past competitions
The translated version of labeled English data in each target language
Small validation data provided by organizers in the target language
Open Subtitles data with labels generated using pseudo labeling
Test dataset with pseudo labels*

Train data was prepared from the above items using stratified sampling ensuring class balance.

Translating the labeled English data to target languages help in absence of labeled data in the target language, but its not as good as the data in the target language itself because the process of translation strips the text of some subtlety which is at times important to detect toxicity.

Pre-trained Models

Transformer architecture based pre-trained (transfer learning) models are the rage in NLP since 2018. Conceptually they are deep learning models trained on huge (terabytes) of text data due to which they learn a very good numerical representation of languages (words, text). These representations are useful in a variety of downstream tasks. You can read more details on Transformers and Transfer Learning in one of my earlier blogs

RoBERTa XLM, a variant of Google’s Bert in terms of architecture, and pre-trained on 100+ languages was the workhorse in the competition. It performs surprisingly well out of the box and performed decently even with just English training data on other languages.

In addition to the multilingual model described above, we also used Bert models pre-trained on specific languages of interest. These mono-lingual models performed better for the individual language they were trained on, compared to the multilingual models as one would expect (for a given model capacity, per language performance goes down as we increase the number of languages). But this method is slightly less elegant in practice as it's cumbersome to maintain language-specific models in production if you are dealing with a lot of languages.

Model Training

This is where a lot of art and experience come into play. We used various strategies for model training:

Multi-stage training: Training the model on progressively harder tasks, example training on just labeled English data and then gradually moving to the harder task, example pseudo labels and then did the final round (epoch) of training on the validation data as that was the closes to the test data
Unsupervised Training on Domain-specific data: The enormous language models like RoBERTa are typically trained on generic datasets like news, Wikipedia — this may be quite different compared to the content/structure/vocabulary of your task-specific data. While getting the labeled data is always an expensive proposition, there is often greater availability of unlabeled data (just the raw text data in this case) which can be used for pre-training. This helps the performance when doing the task-specific supervised training later on
Multi-Fold Averaging: The number of training examples putting together all the different types of datasets I described earlier ran into millions. We observed that the model saturated quickly during training and stopped improving after roughly 200K samples. We trained various models using a random subset of data and averaged our predictions on the test dataset using these models

Prediction and Blending / Ensembling

It's a well-known fact that blending predictions from various diverse models tend to outperform the underlying individual model predictions, and is often the best way to improve the final predictions.

Our final solution was a blend of predictions from 4 families of models as described in “Solution Overview”. We trained multi-fold models (training on different subsets of data) within each family to end up with about 30 (!!) odd models.

We took a straight average of multifold predictions within each family and a weighted average of predictions across different families of models. We didn’t really have a quality dataset, which was unseen by underlying models, to come with the weights across the family, so we used some intuition, leader board feedback, and experimentation to determine these weights.

Test Time Augmentation (TTA) is a technique where we make the prediction on not just a test example, but also some of its variants, to end up with a better final prediction. We employed this technique in some of our underlying models, and it gave us a minor boost, here is how it worked:

Make a prediction on the original comment in the test dataset
Make predictions on the translation of the original comment in various languages
Take a weighted average of predictions on original comment and its translations

Hardware

It's no secret that Deep Learning models run a lot faster on GPU compared to a CPU, but Google’s TPU is way faster than even a GPU. They are just phenomenal in terms of memory (allow higher batch size) and speed (allow quick experimentation) they provide and make the training process a lot less painful.

We extensively used the free TPU quota provided by Kaggle and Google Colab during the competition to train our models.

Code

Here is the link to our complete code