Use Pre-trained Word Embedding to detect real disaster tweets

End-2-End Approach

Zeineb Ghrib
Towards Data Science

--

https://unsplash.com/

In this post we will go through the overall text classification pipeline, and especially the data pre-processing steps, we will be using a Glove pre-trained word embedding.
Textual features processing is a little bit more tricky than linear or categorical features. In fact, machine learning algorithms are more about scalars and vectors rather than characters or words. So we have to convert the text input into scalars, and the keystone 🗝 element consists in how to find out the best representation of the input words. This is the main idea behind Natural Language Processing

We will use a dataset from a Kaggle competition called Real or Not? NLP with Disaster Tweets. The task consists in predicting whether or not a given tweet is about a real disaster. To address this text classification task we will use word embedding transformation followed by a recurrent deep learning model. Other less sophisticated solutions, but still efficient, are also possible such as combining tf-idf encoding and a naive Bayes classifier (check out my last post).

Also I will include some handy Python code that can be reproduced in other NLP tasks. The overall source code is accessible in this kaggle notebook.

Introduction :

Models such as LSTM or CNN are more efficient to capture the words order and the semantic relationship between them, which usually is critical to the text’s meaning : a sample from our dataset that is labelled as a real disaster:

'#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires'

It is obvious that the words order was important in the example above.

In the other hand, we need to convert input text to a machine readable format. It exists many technics such as

  • one-hot encoding: each sequence text input is represented in d- dimensional space where d is size of the dataset vocabulary. Each term would get 1 if it is present in the document 0 otherwise. with a large corpus, the vocabulary would be about tens of thousands of tokens, making the one-hot vectors very sparse and inefficient.
  • TF-IDF encoding: words are mapped to numerics generated using tf-idf metric. The platform has integrated fast algorithms making it possible to keep all uni-grams and bi-grams tf-idf encoding without having to apply dimension reducing
  • Word embedding transformation : words are projected to a dense vector space, where semantic distance between words are preserved: (see Figure below):
https://developers.google.com/machine-learning/crash-course/images/linear-relationships.svg

What is pre-trained Word Embeddings?

An embedding is a dense vector that represents a word (or a symbol). By default, the embedding vectors are randomly initialized, then will gradually be improved during the training phase, with the gradient descent algorithm at each back-propagation step, so that similar words or words in the same lexical field or with common stem … will end up close in terms of distance in the new vector space; (see figure below):

by Zeineb Ghrib

Pre-trained word embedding is an example of Transfer Learning. The main idea behind it is to use public embeddings that are already trained on large datasets. Specifically, instead of initializing our neural network weights randomly, we will set these pre trained embeddings as initialization weights. This trick helps to accelerate training and boost the performance of NLP models.

Step0 : Imports & setups:

Before all, let’s import the required libraries and tools that will help us perform the NLP processing and the

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
from collections import Counter
stop=set(stopwords.words('english'))
import re
from nltk.tokenize import word_tokenize
import gensim
import string
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
from keras.models import Sequential
from keras.layers import Embedding,LSTM,Dense,SpatialDropout1D
from keras.initializers import Constant
from sklearn.model_selection import train_test_split
from keras.optimizers import Adam

Step1 : Text cleaning : 🧹

Regardless of the EDA step that can bring out the uncleaned elements and help us to customize the cleaning code, we can apply some basic data cleaning that are recurrent in tweeters such as removing punctuation, html tags urls and emojis, spelling correction,..

Below a python code that can be be reproduced in other similar use cases 😉

Then we will split the datset into:

  • a training dataset (80% of the training dataset)
  • a validation dataset : the remaining 20% of the training dataset that will be used to validate model performences at each epoch
  • the test dataset (optional here) : it is provided by kaggle to make prediction subission
train = df[~df['target'].isna()]
X_train, X_val, y_train, y_val = train_test_split(train, train['target'], test_size=0.2, random_state=42)

Step 2: Text pre-processing 🤖

As mentioned before, machine learning algorithms take numbers as inputs, not text, which means that we need to convert the texts into numerical vectors.
We proceed as follows:

1. Tokenization

It consists in dividing the texts into words or smaller sub-texts, allowing us to determine the “vocabulary” of the dataset (set of unique tokens present in the data). Usually we use word-level representation. For our exemple we will use NLTK Tokenizer()

2. Word indexing:

Construct a vocablary_index mapper based on word frequency: the index would be inversely proportional to the word occurrence frequency in the overall dataset. the most frequent world would have index=1.. And every single word would get a unique index.

These two steps are factorized as follows:

Some explanation about NLTK tokenizer:

  1. fit_on_texts() method 🤖: it creates a vocabulary index based on word frequency.
    Example :“the ghost in the shell” would generate word_index[“the”] = 1; word_index[“ghost”] = 2..
    -> So every word gets a unique integer value. Starting from 1 (0 is reserved for padding) and the more frequent is the word, the lower the corresponding index gets.
    (PS often the first few are stop words because they appear a lot but it is recommended to drop them during the data cleaning).
  2. textes_to_sequences() method📟: Transforms each text to a sequence of integers: each word is mapped to its index from the word_index dictionary.
  3. pad_sequences() method 🎞: in order to make standard the output’s shape, we define a unique vector length (in our example MAX_SEQUENCE_LENGTH is fixed it to 50) : Any longer sequence will be truncated and any shorter sequence will be 0-padded.

Step3: Construct an embedding Matrix 🧱

First of we will download Glove pre-trained embedding from the official site, (because of some technical constraints I had to download it via a code :

Then we will create an embedding matrix that maps each word index to its corresponding embedding vector:

https://developers.google.com/machine-learning/guides/text-classification/images/EmbeddingLayer.png

Step4 : Create and train model:

whatsapp robot

We will create a recurrent neural network using a Sequential keras model that will contain:

  1. An Embedding layer with the embedding matrix as initial weight
  2. A dropout layer to avoid over-fitting (check out this excellent post about dropout layers in neural networks and their utilities)
  3. An LSTM layer : including long short term memory cells
  4. An activation layer using the binary_crossentropy loss function

If we want to compute, in addition to the accuracy, the precision, recall and F1-score for our binary Keras Classifier model, we have to calculate them manually, because these metrics are not supported by keras since 2.0 version.

(solution from here)

Now compile and train the model:

To get the validation performances results, use the evaluate() method:

loss, accuracy, f1_score, precision, recall = model.evaluate(tokenized_val, y_val, verbose=0)

Lets checkout the results:

by Zeineb Ghrib from here

These results seems to be pretty good but of course it can be enhanced by fine-tuning the neural network hyper-parameters, or by using auto-ml tools such as prevision, which apply many other transformations, in addition to the wor2vec, such as ngram tokenization, tf-idf or more advanced technics such as BERT transformers.

Conclusion:

In this post I showed you, step by step, how to apply wor2vec transformation from Glove pre-trained word embedding, and how to use it to train a recurrent neural network. Please note that the approach and the code can be reused in other similar use cases. The overall source code can be found in this kaggle notebook.
I also applied on the same dataset a complete different approach : I used tf-idf naive Bayes classifiers, if you want to get more information visit my last post.

I am intending to write a post about how to use a breakthrough algorithm called Bert and compare it with other NLP algorithms

Thanks for reading my post 🤗!! If you have any question you can find me at the chat session in prevision cloud instance or send me an email to : zeineb.ghrib@prevision.io

--

--