The world’s leading publication for data science, AI, and ML professionals.

Detecting Disaster from Tweets (classical ML and LSTM approach)

A Classification task using NLP and comparing two approaches.

Photo by Chris J. Davis on Unsplash
Photo by Chris J. Davis on Unsplash

In this article, I am going to apply two different approaches to the classification task. I will first apply a classic Machine Learning classification algorithm using Gradient Boosting Classifier. Later in the code, I will use the LSTM technique to train an RNN model. As we are dealing with tweets then this is an NLP task and I will share some techniques so you will get more familiar with some common steps in most NLP projects.

I will use the data from the Kaggle challenge called "Natural Language Processing with Disaster Tweets". You can find the "train.csv" file under the "Data" section of the link below.

Natural Language Processing with Disaster Tweets

Dataset has 5 columns. Column "target" is the label column which means I am going to train a model that can predict the value of column "target" using other columns such as "text", "location" and "keyword". Now first let’s understand what each column means:

  • id – a unique identifier for each tweet
  • text – the text of the tweet
  • location – the location the tweet was sent from (may be blank)
  • keyword – a particular keyword from the tweet (may be blank)
  • target – in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

For this task, I will be using libraries such as Sklearn and Keras for training classifier models. Sklearn is being used for training a model using Gradient Boosting Classifier and Keras is being used for training an LSTM model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer

from sklearn import model_selection, metrics, preprocessing, ensemble, model_selection, metrics
from sklearn.feature_extraction.text import CountVectorizer

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Conv1D, Bidirectional, LSTM, Dense, Dropout, Input
from tensorflow.keras.optimizers import Adam

Understanding the data:

For this task, we only use ‘train.csv‘ and will break it down into two parts of the train and the test dataset. I am going to load the data to Pandas Dataframe and take a look at the first couple of rows.

# Rreading train dataset
file_path = "./train.csv"
raw_data = pd.read_csv(file_path)
print("Data points count: ", raw_data['id'].count())
raw_data.head()

First, I would like to get more familiar with datasets to understand the features (columns). Column "target" is a column that our model is going to learn to predict. As it has only two unique values of 0 and 1 therefore this is a binary classification task. I would like to know the radio of the tweets labeled as 0 vs 1 so let’s plot data based on column "target".

Image by author
Image by author

As you can see there are more data points with the label 0 meaning tweets that are not disaster tweets and fewer data points with the label 1 which is tweets that are related to a disaster. Usually, for data that has some skewed labels, it is recommended to use an F-score instead of accuracy for model evaluation, we will address that at the end of this article.

Next, I would like to know how is the missing data points in our dataset for each column. The heatmap below shows that the column "keyword" has very few missing data points and I will fill the missing data points and use this column as a feature. Column "location" has very missing data and the quality of data is very bad. It has values that are not related to location. So I decided not to use this column and drop it. Column "text" which is the main column that has the actual tweet needs to be processed and get cleaned. It has no missing data.

Image by author
Image by author

I have also noticed that there are some tweets that contain less than 3 words and I think two words sentences might not be able to transfer the knowledge very well. So to get a sense of the number of words sentences are made of I want to look at the histogram of word count per sentence.

Image by author
Image by author

As we can see the majority of tweets are between 11 to 19 words, so I have decided to remove tweets with less than 2 words. I believe sentences with 3 words can say enough about the tweet. It might be a good idea to also remove tweets with more than 25–30 words as they might slow down the training time.

Data Cleaning and Preprocessing:

Common steps for data cleaning on the NLP task dealing with tweets are removing special characters, removing stop words, removing URLs, removing numbers, and doing word stemming. But let’s first get more familiar with some NLP data preprocessing concepts:

Vectorization:

Word vectorization is a technique of mapping words to real numbers, or better say vector of real numbers. I have used vectorization from both Sklearn and Keras libraries.

Tokenization:

Tokenization is the task of breaking a phrase which can be anything such as a sentence, a paragraph, or just a text into smaller sections such as a series of words, a series of characters, or a series of subwords, and they are called tokens. One use of tokenization is to generate tokens from a text and later convert tokens to numbers (vectorization).

Padding:

Neural network models require to have inputs to have the same shape and same size, meaning all tweets that are inputted into the model one by one has to have exact same length, that is what padding is useful here. Every tweet in the dataset has a different number of words, we will set a maximum number of words for each tweet, if a tweet is longer then we can drop some words if the tweet has fewer words than the max we can fill either beginning or end of the tweet with fix value such as ‘0’.

Stemming:

The task of stemming is to reduce extra characters from a word to its root or base of a word. For example, stemming both words "working" and "worked" becomes "work".

I used Snowball Stemmer which is a stemming algorithm (also known as the Porter2 stemming algorithm). It is a better version of the Porter Stemmer since some issues were fixed in this stemmer.

Word Embedding:

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.

Now let’s look at the entire data cleaning code:

def clean_text(each_text):

    # remove URL from text
    each_text_no_url = re.sub(r"httpS+", "", each_text)

    # remove numbers from text
    text_no_num = re.sub(r'd+', '', each_text_no_url)

    # tokenize each text
    word_tokens = word_tokenize(text_no_num)

    # remove sptial character
    clean_text = []
    for word in word_tokens:
        clean_text.append("".join([e for e in word if e.isalnum()]))

    # remove stop words and lower
    text_with_no_stop_word = [w.lower() for w in clean_text if not w in stop_words]  

    # do stemming
    stemmed_text = [stemmer.stem(w) for w in text_with_no_stop_word]

    return " ".join(" ".join(stemmed_text).split())

raw_data['clean_text'] = raw_data['text'].apply(lambda x: clean_text(x) )
raw_data['keyword'] = raw_data['keyword'].fillna("none")
raw_data['clean_keyword'] = raw_data['keyword'].apply(lambda x: clean_text(x) )

To be able to use both "text" and "keyword" columns there are various approaches to apply but one simple approach that I have applied was to combine both of these two features into a new feature called "_keywordtext"

# Combine column 'clean_keyword' and 'clean_text' into one
raw_data['keyword_text'] = raw_data['clean_keyword'] + " " + raw_data["clean_text"]

I have used Sklearn’s "_train_testsplit" function to do a train and test split with data shuffling.

feature = "keyword_text"
label = "target"

# split train and test
X_train, X_test,y_train, y_test = model_selection.train_test_split(raw_data[feature],raw_data[label],test_size=0.3,random_state=0,shuffle=True)

As I already mentioned about vectorization we have to convert text to numbers as machine learning models can only work with a number so we use "Countervectorize" here. We do fit and transform on train data and only transform on test data. Make sure there is no fitting happens on test data.

# Vectorize text
vectorizer = CountVectorizer()
X_train_GBC = vectorizer.fit_transform(X_train_GBC)
x_test_GBC = vectorizer.transform(x_test_GBC)

GradientBoostingClassifier:

Gradient Boosting Classifier is a machine learning algorithm that combines many weak learning models such as decision trees together to create a strong predictive model.

model = ensemble.GradientBoostingClassifier(learning_rate=0.1,                                            
                                            n_estimators=2000,
                                            max_depth=9,
                                            min_samples_split=6,
                                            min_samples_leaf=2,
                                            max_features=8,
                                            subsample=0.9)
model.fit(X_train_GBC, y_train)

A good metric to evaluate the performance of our model is F-score. Before calculating F-score let’s get familiar with Precision and Recall.

Precision: Out of data points we correctly labeled positive, how many we labeled positive correctly.

Recall: Out of data points we correctly labeled positive, how many actually are positive.

F-score: is the harmonic mean of the recall and precision.

# Evaluate the model
predicted_prob = model.predict_proba(x_test_GBC)[:,1]
predicted = model.predict(x_test_GBC)

accuracy = metrics.accuracy_score(predicted, y_test)
print("Test accuracy: ", accuracy)
print(metrics.classification_report(y_test, predicted, target_names=["0", "1"]))
print("Test F-scoare: ", metrics.f1_score(y_test, predicted))
Test accuracy:  0.7986784140969163
              precision    recall  f1-score   support

           0       0.79      0.88      0.83      1309
           1       0.81      0.69      0.74       961

    accuracy                           0.80      2270
   macro avg       0.80      0.78      0.79      2270
weighted avg       0.80      0.80      0.80      2270

Test F-scoare:  0.7439775910364146
Image by author
Image by author

A confusion matrix is a table that shows the performance of the classification model in comparison to two classes. As we can see in the plot our model had a better performance on detecting target value "0" than target value "1".


LSTM:

LSTM stands for Long Short Term Memory network is a kind of RNN (Recurrent Neural Network) that is capable of learning long-term dependencies and they can remember information for a long period of time as designed with an internal memory system.

I have talked about word embedding above now it is time to use that for our LSTM approach. I have used GloVe embedding from Stanford which you can download from here. After we read the GloVe embedding file then we create an embedding layer using Keras.

# Read word embeddings
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))
# Define embedding layer in Keras
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_layer = tf.keras.layers.Embedding(vocab_size,embedding_dim,weights[embedding_matrix],input_length=sequence_len,trainable=False)

For the LSTM model, I have started with an embedding layer to generate an embedding vector for each input sequence. Then I used a convolution model to lower the number of features followed by a bidirectional LSTM layer. The last layer is a dense layer. Because it is a binary classification we use sigmoid as an activation function.

# Define model architecture
sequence_input = Input(shape=(sequence_len, ), dtype='int32')
embedding_sequences = embedding_layer(sequence_input)

x = Conv1D(128, 5, activation='relu')(embedding_sequences)
x = Bidirectional(LSTM(128, dropout=0.5, recurrent_dropout=0.2))(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu')(x)
outputs = Dense(1, activation='sigmoid')(x)
model = Model(sequence_input, outputs)
model.summary()

For model optimization, I have used the Adam optimization with _binarycrossentropy as a loss function.

# Optimize the model
model.compile(optimizer=Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])

After the model training is done I wanted to see the Learning Curve of train accuracy and loss. The plot shows as model accuracy is increasing and the loss is decreasing as expected over each epoch.

Image by author
Image by author

Now I have trained the model so it is time to evaluate its performance of the model. I will get the model accuracy and F-score for test data. Because the prediction value is a float between 0 and 1, I used 0.5 as a threshold to separate "0" and "1".

#Evaluate the model
predicted = model.predict(X_test, verbose=1, batch_size=10000)

y_predicted = [1 if each > 0.5 else 0 for each in predicted]

score, test_accuracy = model.evaluate(X_test, y_test, batch_size=10000)

print("Test Accuracy: ", test_accuracy)
print(metrics.classification_report(list(y_test), y_predicted))

Test Accuracy:  0.7726872
              precision    recall  f1-score   support

           0       0.78      0.84      0.81      1309
           1       0.76      0.68      0.72       961

    accuracy                           0.77      2270
   macro avg       0.77      0.76      0.76      2270
weighted avg       0.77      0.77      0.77      2270

As we can see in the Confusion matrix the RNN approach performed very similarly to the Gradient Boosting Classifier approach. The model did a better job at detecting "0" than detecting "1".

Image by author
Image by author

Conclusion:

As you can see the output of both approaches were very close to each other. Gradient Boosting classifier trained much faster than an LSTM model.

There are many ways to improve the model’s performance such as modifying input data, applying different training approaches, or using hyperparameter search algorithms such as GridSearch or RandomizedSearch to find the best values for hyperparameters.

You can find the entire code here:


Reference:

Keras documentation: Using pre-trained word embeddings


Related Articles