The world’s leading publication for data science, AI, and ML professionals.

How can AI help to prevent Cyberbullying?

Let's learn by example and train a Neural Network with PyTorch that is able to recognize toxicity in online conversations.

Data for Change

How Can You Use AI to Prevent Cyberbullying?

Photo by Austin Pacheco on Unsplash
Photo by Austin Pacheco on Unsplash

Cyberharassment is a form of bullying using electronic means. It has become increasingly common, especially among teenagers, as the digital sphere has expanded and technology has advanced.

Three years ago, Toxic Comment Classification Challenge was published on Kaggle. The main aim of the competition was to develop tools that would help to improve online conversation:

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

This article starts with a theoretical part in which you’ll learn about basic concepts of text processing with neural networks. Then it continues with an example of how to train a Convolutional Neural Network that detects toxic comments.

By reading the theoretical part you’ll learn:

  • What is NLP?
  • What is BERT?
  • What is a Convolutional Neural Network (CNN)?
  • How to transform text to embeddings?
  • What is KimCNN?

By reading the practical part you’ll learn:

  • How to load the data
  • How to define the train, validation and test set
  • How to train the Convolutional Neural Network with PyTorch
  • How to test the model

In the end of the article, I also share a link to all the code so that you can run it yourself.

Let’s start with the theory

Photo by Egor Myznik on Unsplash
Photo by Egor Myznik on Unsplash

What is NLP?

Photo by Tim Mossholder on Unsplash
Photo by Tim Mossholder on Unsplash

Natural Language Processing (NLP) is a subfield of linguistics, computer science, information engineering, and Artificial Intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

What is BERT?

Photo by Rajeshwar Bachu on Unsplash
Photo by Rajeshwar Bachu on Unsplash

Bidirectional Encoder Representations from Transformers (BERT) is a language model that was created and published in 2018 by Jacob Devlin and Ming-Wei Chang from Google [3]. BERT replaces the sequential nature of Recurrent Neural Networks with a much faster Attention-based approach. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms – an encoder that reads the text input and a decoder that produces a prediction for the task.

BERT achieved state-of-the-art results in a wide variety of NLP tasks. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. To learn more about BERT, read BERT Explained: State of the art language model for NLP by Rani Horev.

In this article, we are going to use BERT as an encoder and a separate CNN as a decoder that produces predictions for the task.

We could use BERT for this task directly (as described in Multilabel text classification using BERT – the mighty transformer), but we would need to retrain the multi-label classification layer on top of the Transformer so that it would be able to identify the hate speech.

Convolutional Neural Network

Photo by Alina Grubnyak on Unsplash
Photo by Alina Grubnyak on Unsplash

In deep learning, Convolutional Neural Networks are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. CNNs have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self-driving cars.

Because of these successes, many researchers try to apply them to other problems, like NLP.

To learn more about CNNs, read this great article about CNNs: An Intuitive Explanation of Convolutional Neural Networks.

How to transform text to embeddings?

Photo by Max Chen on Unsplash
Photo by Max Chen on Unsplash

To make the Convolutional Neural Network work with textual data, we need to transform each word of comment into a vector.

Huggingface developed a Natural Language Processing (NLP) library called transformers that can transform words to vectors (among many other things). It also supports multiple state-of-the-art language models for NLP, like BERT.

With BERT each word of a comment is transformed into a vector of size [1 x 768] (768 is the length of a BERT embedding).

A comment consists of multiple words, so we get a matrix [n x 768], where n is the number of words in a comment. There are actually more than n words as BERT inserts [CLS] token at the beginning of the first sentence and a [SEP] token at the end of each sentence.

In this article, we are going to use a smaller BERT language model, which has 12 attention layers and uses a vocabulary of 30522 words.

BERT uses a tokenizer to split the input text into a list of tokens that are available in the vocabulary. It also learns words that are not in the vocabulary by splitting them into subwords.

What is KimCNN?

Photo by Moritz Kindler on Unsplash
Photo by Moritz Kindler on Unsplash

The KimCNN [1] was introduced in a paper Convolutional Neural Networks for Sentence Classification by Yoon Kim from New York University in 2014. At the time, it improved the accuracy of multiple NLP tasks. The KimCNN uses a similar architecture as the network used for analyzing visual imagery.

Steps of KimCNN [2]:

  1. Take a word embedding on the input [n x m], where n represents the maximum number of words in a sentence and m represents the length of the embedding.
  2. Apply convolution operations on embeddings. It uses multiple convolutions of different sizes [2 x m], [3 x m] and [4 x m]. The intuition behind this is to model combinations of 2 words, 3 words, etc. Note, that convolution width is m – the size of the embedding. This is different from CNNs for images as they use square convolutions like [5 x 5] . This is because [1 x m] represents a whole word and it doesn’t make sense to run a convolution with a smaller kernel size (eg. a convolution on half of the word).
  3. Apply Rectified Linear Unit (ReLU) to add the ability to model nonlinear problems.
  4. Apply 1-max pooling to down-sample the input representation and to help to prevent overfitting. Fewer parameters also reduce computational cost.
  5. Concatenate vectors from previous operations to a single vector.
  6. Add a dropout layer to deal with overfitting.
  7. Apply a softmax function to distribute the probability between classes. Our network differs here because we are dealing with a multilabel classification problem – each comment can have multiple labels (or none). We use a sigmoid function, which scales logits between 0 and 1 for each class. This means that multiple classes can be predicted at the same time.

Let’s continue with the practical part

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Initialize the libraries

Before we start, we need to import all the libraries that we’re going to use to develop the AI model.

In case any library is missing on your machine, you can install it with:

pip install library_name

Let’s import the libraries:

%matplotlib inline
import logging
import time
from platform import python_version
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from sklearn.metrics import roc_auc_score
from torch.autograd import Variable

Load the data

Toxic Comment Classification Challenge - image from lamamatropicana
Toxic Comment Classification Challenge – image from lamamatropicana

Go to Toxic Comment Classification Challenge and download the data (unzip it and rename the folder to data).

We’ll train and test the model with train.csv because entries in test.csv are without labels and are intended for Kaggle submissions.

Let’s load the data.

df = pd.read_csv('data/train.csv')

We set the random seed to make the experiment repeatable and shuffle the dataset.

np.random.seed(42)

Shuffling data serves the purpose of reducing variance and making sure that the model won’t overfit to the sequence of samples in the training set.

df = df.sample(frac=1)
df = df.reset_index(drop=True)

The dataset consists of comments and different types of toxicity like threats, obscenity and insults. This problem is in the domain of Multi-label classification because each comment can be tagged with multiple insults (or none).

df.head()
First few samples in the dataset
First few samples in the dataset

Let’s display the first comment – don’t worry, it is not a toxicity threat 🙂

df.comment_text[0]

Geez, are you forgetful! We’ve already discussed why Marx was not an anarchist, i.e. he wanted to use a State to mold his ‘socialist man.’ Ergo, he is a statist – the opposite of an anarchist. I know a guy who says that, when he gets old and his teeth fall out, he’ll quit eating meat. Would you call him a vegetarian?

Eg. the comment with id 103 is marked as toxic, severe_toxic, obscene, and insult (the comment_text is intentionally hidden). The model should be able to flag comments like this.

Targets for 103rd sample in the dataset
Targets for 103rd sample in the dataset

Define the datasets

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

We limit the size of the trainset to 10.000 comments as we train the Neural Network (NN) on the CPU.

The validation set (1.000 comments) is used to measure the accuracy of the model during training and the test set (2.000 comments) is used to measure the accuracy after model is trained.

Let’s load the BERT model, Bert Tokenizer and bert-base-uncased pre-trained weights.

We transform each comment into a 2D matrix. Matrices have a predefined size, but some comments have more words than others. To transform a comment to a matrix, we need to:

  • limit the length of a comment to 100 words (100 is an arbitrary number),
  • pad a comment with less than 100 words (add 0 vectors to the end).

    BERT doesn’t simply map each word to an embedding like it is the case with some context-free pre-trained language models (Word2Vec, FastText or GloVe). To calculate the context, we need to feed the comments to the BERT model.

In the code below, we tokenize, pad and convert comments to PyTorch Tensors. Then we use BERT to transform the text to embeddings. This process takes some time so be patient.

This is the first comment transformed into word embeddings with BERT. It has a [100 x 768] shape.

x_train[0]

The first comment is not toxic and it has just 0 values.

y_train[0]

Train the model with PyTorch

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

The code below defines KimCCN with PyTorch library.

Let’s set the parameters of the model:

  • embed_num represents the maximum number of words in a comment (100 in this example).
  • embed_dim represents the size of BERT embedding (768).
  • class_num is the number of toxicity threats to predict (6).
  • kernel_num is the number of filters for each convolution operation (eg. 3 filters for [2 x m] convolution).
  • kernel_sizes of convolutions. Eg. look at combinations 2 words, 3 words, etc.
  • dropout is the percentage of randomly set hidden units to 0 at each update of the training phase. Tip: Make sure you disable dropout during test/validation phase to get deterministic output.
  • static parameter True means that we don’t calculate gradients of embeddings and they stay static. If we set it to False, it would increase the number of parameters the model needs to learn and it could overfit.
embed_num = x_train.shape[1]
embed_dim = x_train.shape[2]
class_num = y_train.shape[1]
kernel_num = 3
kernel_sizes = [2, 3, 4]
dropout = 0.5
static = True
model = KimCNN(
    embed_num=embed_num,
    embed_dim=embed_dim,
    class_num=class_num,
    kernel_num=kernel_num,
    kernel_sizes=kernel_sizes,
    dropout=dropout,
    static=static,
)

We train the model for 10 epochs with batch size set to 10 and the learning rate to 0.001. We use Adam optimizer with the BCE loss function (binary cross-entropy). Binary cross-entropy loss allows our model to assign independent probabilities to the labels, which is a necessity for multilabel classification problems.

n_epochs = 10
batch_size = 10
lr = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss_fn = nn.BCELoss()

The code below generates batches of data for training.

Let’s train the model.

Training the model multiple epochs
Training the model multiple epochs

In the image below, we can observe that train and validation loss converge after 10 epochs.

plt.plot(train_losses, label="Training loss")
plt.plot(val_losses, label="Validation loss")
plt.legend()
plt.title("Losses")
Training and validation loss
Training and validation loss

Test the model

Photo by JESHOOTS.COM on Unsplash
Photo by JESHOOTS.COM on Unsplash

The model is trained. We evaluate the model performance with the Area Under the Receiver Operating Characteristic Curve (ROC AUC) on the test set. scikit-learn’s implementation of AUC supports the binary and multilabel indicator format.

Let’s use the model to predict the labels for the test set.

model.eval() # disable dropout for deterministic output
with torch.no_grad(): # deactivate autograd engine to reduce memory usage and speed up computations
    y_preds = []
    batch = 0
    for x_batch, y_batch, batch in generate_batch_data(x_test, y_test, batch_size):
        y_pred = model(x_batch)
        y_preds.extend(y_pred.cpu().numpy().tolist())
    y_preds_np = np.array(y_preds)

The model output 6 values (one for each toxicity threat) between 0 and 1 for each comment. We can use 0.5 as a threshold to transform all the values greater than 0.5 to toxicity threats, but let’s calculate the AUC first.

y_preds_np

We extract real labels of toxicity threats for the test set. Real labels are binary values.

y_test_np = df_test[target_columns].values
y_test_np[1000:]

The AUC of a model is equal to the probability that the model will rank a randomly chosen positive example higher than a randomly chosen negative example. The higher the AUC, the better (although it is not that simple, as we will see below). When AUC is close to 0.5, it means that the model has no label separation capacity whatsoever. When AUC is close to 0, it means that we need to invert predictions and it should work well 🙂

Let’s calculate the AUC for each label.

auc_scores = roc_auc_score(y_test_np, y_preds_np, average=None)
df_accuracy = pd.DataFrame({"label": target_columns, "auc": auc_scores})
df_accuracy.sort_values('auc')[::-1]

In the table above, we can observe that the model achieves high AUC for every label. Note, AUC can be a misleading metric when working with an imbalanced dataset.

Is the dataset imbalanced?

Photo by Aziz Acharki on Unsplash
Photo by Aziz Acharki on Unsplash

We say that the dataset is balanced when 50% of labels belong to each class. The dataset is imbalanced when this ratio is closer to 90% to 10%. The known problem with models trained on imbalanced datasets is that they report high accuracies. Eg. If the model predicts always 0, it can achieve 90% accuracy.

Let’s check if we have an imbalanced dataset.

positive_labels = df_train[target_columns].sum().sum()
positive_labels
# Output:
2201
all_labels = df_train[target_columns].count().sum()
all_labels
# Output:
60000
positive_labels/all_labels
# Output:
0.03668333333333333

Only 2201 labels are positive out of 60000 labels. The dataset is imbalanced, so the reported accuracy above shouldn’t be taken too seriously.

Sanity check

Photo by Antoine Dautry on Unsplash
Photo by Antoine Dautry on Unsplash

Let’s do a sanity check to see if the model predicts all comments as 0 toxicity threats.

df_test_targets = df_test[target_columns]
df_pred_targets = pd.DataFrame(y_preds_np.round(), columns=target_columns, dtype=int)
df_sanity = df_test_targets.join(df_pred_targets, how='inner', rsuffix='_pred')
df_test_targets.sum()
df_pred_targets.sum()

We can observe that the model predicted 3 toxicity threats: toxic, obscene and insults, but it never predicted severe_toxic, threat and identify_hate. This doesn’t seem great, but at least it didn’t mark all comments with zeros.

df_sanity[df_sanity.toxic > 0][['toxic', 'toxic_pred']]

We see that the model correctly predicted some comments as toxic.

Conclusion

Photo by Dawid Zawiła on Unsplash
Photo by Dawid Zawiła on Unsplash

We trained a CNN with BERT embeddings for identifying hate speech. We used a relatively small dataset to make computation faster. Instead of BERT, we could use Word2Vec, which would speed up the transformation of words to embeddings. We spend zero time optimizing the model as this is not the purpose of this post. So reported accuracies shouldn’t be taken too seriously. The more important are outlined pitfalls with imbalanced datasets, AUC and the dropout layer.

Instead of using novel tools like BERT, we could go old school with TD-IDF and Logistic Regression. Would you like to read a post about it? Let me know in the comments below.

To run the code on your machine, download this Jupyter notebook.

Before you go

- 50% Off All AI Nanodegree Programs! [Course]
- Data Science for Business Leaders [Course]
- Free skill tests for Data Scientists & Machine Learning Engineers
- Labeling and Data Engineering for Conversational AI and Analytics

Some of the links above are affiliate links and if you go through them to make a purchase I’ll earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases.

Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.

References

[1] Yoon Kim, Convolutional Neural Networks for Sentence Classification (2014), https://arxiv.org/pdf/1408.5882.pdf

[2] Ye Zhang, A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (2016), https://arxiv.org/pdf/1510.03820.pdf

[3] Jacob Devlin, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018), https://arxiv.org/abs/1810.04805


Related Articles