Text Classification on Disaster Tweets with LSTM and Word Embedding

Word Embedding on Text Classification accuracy

Emmanuella Anggi
Towards Data Science

--

https://www.istockphoto.com/photo/disaster-area-with-tornado-gm172448852-23567577

This was my first Kaggle notebook and I thought why not write it on Medium too?

Full code on my Github.

In this post, I will elaborate on how to use fastText and GloVe as word embedding on LSTM model for text classification. I got interested in Word Embedding while doing my paper on Natural Language Generation. It showed that embedding matrix for the weight on embedding layer improved the performance of the model. But since it was NLG, the measurement was subjective. And I only used fastText too. So in this article, I want to see how each method (with fastText and GloVe and without) affects to the prediction. On my Github code, I also compare the result with CNN. The dataset that i use here is from one of competition on Kaggle, consisted of tweets and labelled with whether the tweet is using disastrous words to inform a real disaster or merely just used it metaphorically. Honestly, on first seeing this dataset, I immediately thought about BERT and its ability to understand way better than what I proposed on this article (further reading on BERT).

But anyway, in this article I will focus on fastText and GloVe.

Let’s go?

Data + Pre-Processing

The data consisted of 7613 tweets (columns Text) with label (column Target) whether they were talking about a real disaster or not. With 3271 rows informing real disaster and 4342 rows informing not real disaster. The data shared on kaggle competition, and if you want to learn more about the data you can read it here.

Example of real disaster word in a text :

“ Forest fire near La Ronge Sask. Canada “

Example of the use of disaster word but not about disaster:

“These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittens”

The data will be divided for training (6090 rows) and testing (1523 rows) then proceed to pre-processing. We will only be using the text and target columns.

from sklearn.model_selection import train_test_splitdata = pd.read_csv('train.csv', sep=',', header=0)train_df, test_df = train_test_split(data, test_size=0.2, random_state=42, shuffle=True)

Pre-processing steps that used here:

  1. Case Folding
  2. Cleaning Stop Words
  3. Tokenizing
from sklearn.utils import shuffleraw_docs_train = train_df['text'].tolist()
raw_docs_test = test_df['text'].tolist()
num_classes = len(label_names)
processed_docs_train = []for doc in tqdm(raw_docs_train):
tokens = word_tokenize(doc)
filtered = [word for word in tokens if word not in stop_words]
processed_docs_train.append(" ".join(filtered))
processed_docs_test = []for doc in tqdm(raw_docs_test):
tokens = word_tokenize(doc)
filtered = [word for word in tokens if word not in stop_words]
processed_docs_test.append(" ".join(filtered))
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
tokenizer.fit_on_texts(processed_docs_train + processed_docs_test)
word_seq_train = tokenizer.texts_to_sequences(processed_docs_train)
word_seq_test = tokenizer.texts_to_sequences(processed_docs_test)
word_index = tokenizer.word_index
word_seq_train = sequence.pad_sequences(word_seq_train, maxlen=max_seq_len)word_seq_test = sequence.pad_sequences(word_seq_test, maxlen=max_seq_len)

Word Embedding

Step 1. Download Pre-trained model

The first step on working both with fastText and Glove is downloading each of pre-trained model. I used Google Colab to prevent the use of big memory on my laptop, so I downloaded it with request library and unzip it directly on the notebook.

I used the biggest pre-trained model from both word embedding. fastText model gave 2 million word vectors (600B tokens) and GloVe gave 2.2 million word vectors (840B tokens), both trained on Common Crawl.

fastText pre-trained download

import requests, zipfile, iozip_file_url = “https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"r = requests.get(zip_file_url)z = zipfile.ZipFile(io.BytesIO(r.content))z.extractall()

GloVe pre-trained download

import requests, zipfile, iozip_file_url = “http://nlp.stanford.edu/data/glove.840B.300d.zip"r = requests.get(zip_file_url)z = zipfile.ZipFile(io.BytesIO(r.content))z.extractall()

Step 2. Load Pre-trained model to Word Vectors

FastText gave the format to load the word vectors and so I used that to load both models.

embeddings_index = {}f = codecs.open(‘crawl-300d-2M.vec’, encoding=’utf-8')
# for Glove
# f = codecs.open(‘glove.840B.300d.txt’, encoding=’utf-8')
for line in tqdm(f):values = line.rstrip().rsplit(‘ ‘)word = values[0]coefs = np.asarray(values[1:], dtype=’float32')embeddings_index[word] = coefsf.close()

Step 3. Embedding Matrix

Embedding matrix will be used in embedding layer for the weight of each word in training data. It’s made by enumerating each unique word in the training dataset that existed in tokenized word index and locate the embedding weight with the weight from fastText orGloVe (more about embedding matrix).

But there is a possibility that there are words that aren’t in the vectors such as typos or abbreviation or usernames. Those words will be stored in a list and we can compare the performance of handling words from fastText and GloVe

words_not_found = []nb_words = min(MAX_NB_WORDS, len(word_index)+1)
embedding_matrix = np.zeros((nb_words, embed_dim))
for word, i in word_index.items():
if i >= nb_words:
continue
embedding_vector = embeddings_index.get(word)

if (embedding_vector is not None) and len(embedding_vector) > 0:
embedding_matrix[i] = embedding_vector
else:
words_not_found.append(word)
print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

Number of null word embeddings on fastText is 9175 and on GloVe is 9186. Can be assumed that fastText handle more words even when the pre-trained was trained on fewer words.

Long Short-Term Memory (LSTM)

You can do fine-tuning on hyper-parameters or architecture, but I’m going to use the very simple one with Embedding Layer, LSTM layer, Dense layer, and Drop out Layer.

from keras.layers import BatchNormalization
import tensorflow as tf
model = tf.keras.Sequential()model.add(Embedding(nb_words, embed_dim, input_length=max_seq_len, weights=[embedding_matrix],trainable=False))model.add(Bidirectional(LSTM(32, return_sequences= True)))
model.add(Dense(32,activation=’relu’))
model.add(Dropout(0.3))
model.add(Dense(1,activation=’sigmoid’))
model.summary()
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])es_callback = EarlyStopping(monitor='val_loss', patience=3)history = model.fit(word_seq_train, y_train, batch_size=256, epochs=30, validation_split=0.3, callbacks=[es_callback], shuffle=False)

Result

fastText gave the best performance with accuracy for about 83% while GloVe gave 81% accuracy. The difference on the performance isn’t so significant but to compare it with the performance of model without word embedding (68%), we can see the significant use of Word Embedding on embedding layer weight.

Accuracy with fastText Word Embedding
Accuracy with GloVe Word Embedding
Accuracy Without Word Embedding

For more about the training performance, detail code, and if you want to apply it on a different dataset, you can see the full code on my GitHub.

Thank you for reading!

--

--