A supervised or semi-supervised ULMFit model to Twitter US Airlines Sentiment Dataset

A supervised or semi-supervised ULMFit model to Twitter US Airlines Sentiment Dataset

Aadit Kapoor
Towards Data Science

--

Our task is to apply a supervised/semi-supervised technique like ULMFit (Ruder et al, 2018) to the Twitter US Airlines sentiment analysis data.
The reason this problem is semi-supervised is that it is first followed by an unsupervised way of training then fine-tuning the network by adding a classifier network at the top of the network.

We use the Twitter US Airlines dataset (https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

https://unsplash.com/photos/rf6ywHVkrlY

We will start by:

  • Exploring the dataset, preprocessing and preparing it for the model
  • Exploring a bit of history in sentiment analysis
  • Exploring Language Models and why they are important
  • Setting the baseline model
  • Exploring the techniques to perform text classification
  • A brief overview of ULMFit
  • Applying ULMFIT TO Twitter US Airlines data
  • Results and Prediction
  • Conclusion and Future Direction

The Dataset

We will start by exploring the dataset statistics and performing all the mandatory feature transformations.

  • As this is a multiclass classification problem, we will encode the target variable.
  • We will change the order in which columns are presented
  • We will perform basic statistics to get some insight from the data
  • Finally, we will split the new data frame into df_train, df_val, df_test
# Loading datasetdf = pd.read_csv(DATA_DIR)# LabelEncoder to change positive, negative and neutral to numbers (classes)labelEncoder = LabelEncoder()def cleanAscii(text):"""Remove Non ASCII characters from the dataset.Arguments:text: str"""return ''.join(i for i in text if ord(i) < 128)def gather_texts_and_labels(df=None, test_size=0.15,random_state=42):"""Gathers the text and the corresponding labels from the dataset and splits it.Arguments:df: Pandas DataFrametest_size: represents the test sizerandom_state: represents the random stateReturns:(x_train, x_test, y_train, y_test, new_df)"""# textstexts = df["text"].values# encoding labels (positive, neutral, negative)df['airline_sentiment'] = labelEncoder.fit_transform(df['airline_sentiment'])labels = df['airline_sentiment'].values# changing the order for fastai tokenizers to capture data.new_df = pd.DataFrame(data={"label":labels, "text":texts})df_train, df_test = train_test_split(new_df, stratify = new_df['label'], test_size=test_size, random_state = random_state)df_train, df_val = train_test_split(df_train, stratify = df_train['label'], test_size = test_size,random_state = random_state)print("Training: {}, Testing: {}, Val: {}".format(len(df_train), len(df_test), len(df_val)))return df_train, df_test, df_val,new_dfdef describe_dataset(df=None):"""Describes the datasetArguments:df: Pandas Dataframe"""print(df["airline_sentiment"].value_counts())print(df["airline"].value_counts())print("\nMean airline_sentiment_confidence is {}".format(df.airline_sentiment_confidence.mean()))# Optionaldef add_negativereason_to_text(df=None):# change negativereason to "" if NaN else remain as is.df['negativereason'] = df['negativereason'].apply(lambda x: "" if pd.isna(x) else x)# add negativereason to textdf['text'] = df['text'] + df['negativereason']add_negativereason_to_text(df)df['text'] = df['text'].apply(cleanAscii)describe_dataset(df)df_train, df_test, df_val, new_df = gather_texts_and_labels(df)
stats for the data
Some essential functions
Some visual stats
Some more stats

we will rely on different metrics to measure to performance of the model (precision, recall,F1 Score).

History

History

Before ULMFit (2018) or transfer learning in NLP to be precise, we used word embeddings such as word2Vec or GLove to represent words as a dense sparse vector representation. Generally, we used the embedding layers as the first layer in the model and then attached a classifier according to our needs. This made the system very difficult to train as it required huge amounts of data. These language models were early statistical LMs that used probability distributions to represent words. (“By the company a word keeps”).

  • ULMfit, BERT, Universal sentence encoder, OpenAI GPT-2 used something called neural language models to represent words in a distributed fashion and allowed fine-tuning a large pre-trained language model to aid in our tasks.
  • Specifically, ULMfit (2018) introduced three novel techniques to fine-tune pre-trained language models
  • Fine-tuning was a popular method in Computer Vision and while this method was tried on NLP, it turned out that the approach was wrong before ULMFit.

Further in the article, we will see an overview of the language model and the classifier.

Setting the baseline

Before any machine learning experiment, we should always set up a baseline and compare our results with it.

To set up the baseline, we will use a word2vec embedding matrix to try to predict sentiment.

  • To Load our word2vec, we will be using the embedding layer, followed by basic Feed Forward NN to predict sentiment.

We could have also loaded a pre-trained word2vec or glove embeddings to be fed into our embedding layer.
We could have used an LSTM or CNN after the embedding layer followed by a softmax activation.

# The word2vec requires sentences as list of lists.texts = df['text'].apply(cleanAscii).valuestokenizer = keras.preprocessing.text.Tokenizer(num_words=5000, oov_token='<OOV>')# fittingtokenizer.fit_on_texts(texts)vocab_size = len(tokenizer.word_index) + 1# max length to be padded (batch_size, 100)max_length = 100train_text = tokenizer.texts_to_sequences(df_train['text'].values)test_text = tokenizer.texts_to_sequences(df_test['text'].values)# getting the padded length of 100padded_train_text = keras.preprocessing.sequence.pad_sequences(train_text, max_length, padding='post')padded_test_text = keras.preprocessing.sequence.pad_sequences(test_text, max_length, padding='post')labels_train = keras.utils.to_categorical(df_train['label'].values, 3)labels_test = keras.utils.to_categorical(df_test['label'].values, 3)metrics = [keras.metrics.Accuracy()]net = Sequential()# return 50 dimension embedding representation with input_length as 100net.add(keras.layers.Embedding(vocab_size, 50, input_length=max_length))net.add(keras.layers.Flatten())net.add(keras.layers.Dense(512, activation='relu'))net.add(keras.layers.Dense(3, activation='softmax'))net.compile(optimizer='adam', loss=keras.losses.categorical_crossentropy, metrics=metrics)net.summary()# The word2vec requires sentences as list of lists.texts = df['text'].apply(cleanAscii).valuestokenizer = keras.preprocessing.text.Tokenizer(num_words=5000, oov_token='<OOV>')# fittingtokenizer.fit_on_texts(texts)vocab_size = len(tokenizer.word_index) + 1# max length to be padded (batch_size, 100)max_length = 100train_text = tokenizer.texts_to_sequences(df_train['text'].values)test_text = tokenizer.texts_to_sequences(df_test['text'].values)# getting the padded length of 100padded_train_text = keras.preprocessing.sequence.pad_sequences(train_text, max_length, padding='post')padded_test_text = keras.preprocessing.sequence.pad_sequences(test_text, max_length, padding='post')labels_train = keras.utils.to_categorical(df_train['label'].values, 3)labels_test = keras.utils.to_categorical(df_test['label'].values, 3)metrics = [keras.metrics.Accuracy()]net = Sequential()# return 50 dimension embedding representation with input_length as 100net.add(keras.layers.Embedding(vocab_size, 50, input_length=max_length))net.add(keras.layers.Flatten())net.add(keras.layers.Dense(512, activation='relu'))net.add(keras.layers.Dense(3, activation='softmax'))net.compile(optimizer='adam', loss=keras.losses.categorical_crossentropy, metrics=metrics)net.summary()
Model summary
Training
# test the baseline model
def test_baseline_sentiment(text):
"""Test the baseline modelArguments:text:str"""padded_text = keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences([text]), max_length, padding='post')print(net.predict(padded_text).argmax(axis=1))net.evaluate(padded_test_text, labels_test)preds = net.predict(padded_test_text).argmax(axis=1)

As you can see, with a simple Feed Forward NN and with an embedding layer, we hardly reach an accuracy of 12%

Loading the Language Model and Fine Tuning

FastAI provides us with an easy to use language model trained on wiki texts (AWD).

We will start by loading the LM data and initializing it with the required data.

data_lm = TextLMDataBunch.from_df(train_df = df_train, valid_df = df_val, path = "")# Saving the data_lm as backupdata_lm.save("data_lm_twitter.pkl") # saving as a back stop# Loading the language model (AWD_LSTM)learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)print(learn)
Our sample data

As you can see the fastai library uses a spacy tokenizer so we do not perform any preprocessing to the data expect removing asci characters. The tokenization process is well tested empirically by the authors of ULMFit.

Training

# Finding the optimal learning ratelearn.lr_find(start_lr=1e-8, end_lr=1e2)learn.recorder.plot()# Fit using one cycle policylearn.fit_one_cycle(1, 1e-2)# Unfreeze all layerslearn.unfreeze()# fit one cycle for 10 epochslearn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))# save the encoderlearn.save_encoder('fine_tuned_enc') # we need the encoder in particular..FOr classifier
Model progress

Text Classification

We now create add our classifier in below the network (fine-tune). This is the last step in adding the specified task classifier to the pre-trained language model

This is the gradual freezing step.

# Preparing the classifier datadata_clas = TextClasDataBunch.from_df(path = "", train_df = df_train, valid_df = df_val, test_df=df_test, vocab=data_lm.train_ds.vocab)# Building the classifierlearn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)# loading the saved encoderlearn.load_encoder('fine_tuned_enc') # load th encoder from the LM# slanted learning rate scheduler# fine tuning the whole networklearn.fit_one_cycle(3, 1e-2, moms=(0.8,0.7))  # you can of course train more, Jeremy promises its hard to over fit here :D# fine tuning the network layer by layer to preserve as much information is possible.learn.freeze_to(-2) # unfreeze last 2 layerslearn.fit_one_cycle(2, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))learn.freeze_to(-3) # unfreeze last 3 layerslearn.fit_one_cycle(2, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))learn.freeze_to(-4) # unfreeze last 4 layerslearn.fit_one_cycle(2, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))learn.freeze_to(-5) # unfreeze last 5 layerslearn.fit_one_cycle(2, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))# Unfreezing all the layers and traininglearn.unfreeze() # unfreze alllearn.fit_one_cycle(3, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))
We achieve an accuracy of 94%

Brief Overview of ULMFit

Overview of the ULMfit process

https://arxiv.org/abs/1801.06146

The different types of processes are as follows:

  • LM Pretraining: This is the step where we follow unsupervised learning to capture the semantic and probabilistic representations of a large corpus. (WIKITEXT-103)
  • LM Fine-tuning: This is the step where we fine-tune the LM by using certain novel techniques. As each layer of the AWD-LSTM (pretrained model) captures different information about the corpus, we first fine-tune the last layer as it contains the least amount of information while all other layers are frozen. We then unfreeze all other layers to retrain the model again with the specified task. In this way, we do not lose the information. The training is done by using the slanted triangular learning rate (cyclic learning rate with the mode as triangular).
  • The last step is classifier fine-tuning where the classifier model is attached to the model’s top and trained by using gradual unfreezing where we train the model by unfreezing layer by layer.

These techniques are:

  • Discriminative Fine-tuning
  • Slanted triangular learning rates
  • Gradual freezing

ULMFit on Twitter US Airlines Sentiment. (Prediction and Accuracy)

def get_sentiment(text:str):"""Get the sentiment of text.Arguments:text: the text sentiment to be predicted"""index = learn.predict("This was a great movie!")[2].numpy().argmax()print("Predicted sentiment: {}".format(mapping[index]))def evaluate():"""Evaluates the networkArguments:NoneReturns:accuracy: float"""texts = df_test['text'].valueslabels = df_test['label'].valuespreds = []for t in texts:preds.append(learn.predict(t)[1].numpy())acc = (labels == preds).mean() * 100print("Test Accuracy: {}".format(acc))return preds, labelsget_sentiment("This is amazing")preds, labels = evaluate()print(classification_report(labels, preds, labels=[0,1,2]))print(confusion_matrix(labels, preds))
model results
confusion matrix
  • As you can see our model is good but can be improved by experimenting with the hyperparameters.
  • If we see the confusion matrix, we can see that our model is classifying most of the classes correctly.
  • Black represents 0 and from the plot, we are getting most of the predictions as black

Conclusion and Future Direction

To conclude, we achieve the following results:

  • We train a model to predict the sentiment of a tweet using the US Airlines tweet database.
  • We use the ULMFit (Ruder et al, 2018) to train our model with novel techniques given above.
  • We use the popular fastai library to train the model as it contains the pre-trained weights for AWD-LSTM.
  • We achieve a test accuracy of 94 and as our dataset was imbalanced we use metrics such as F1-score.
  • We get an F1 score of the accuracy of 89.
  • We further examine our model’s performance using a confusion matrix.

To build a better model we can also use other language models and techniques such BERT, USE, Transformers, XLNet etc.

Colab Notebook: https://colab.research.google.com/drive/1eiSmiFjg1aeNgepSfSEB55BJccioP5PQ?usp=sharing

--

--