The world’s leading publication for data science, AI, and ML professionals.

Predicting Tweet Sentiment With Word2Vec Embeddings

Implementing Word2Vec Embeddings and Using Them Predict Tweet Sentiment

Photo by Ravi Sharma on Unsplash
Photo by Ravi Sharma on Unsplash

As the year comes to a close, I’ve been revisiting previous work mainly for fun and to have things in order for the beginning of the new year. Previously, I did an introduction to sentiment analysis then started a project where I used Machine Learning to predict the sentiment of a tweet, and lastly used that same project to determine a systematic way of improving our machine learning models.

Getting Started With Sentiment Analysis

Predicting Tweet Sentiment with Machine Learning

Systematically Improving Your Machine Learning Model

Using our systematic strategy to improve our models, we were able to get our score from 0.71008 to 0.79374 on the Kaggle leaderboard respectively. These scores are great. However, they are still quite bad from a Kaggle point of view so we would want to improve on them.

Error analysis is very beneficial for determining what sort of errors our algorithm is making, hence providing us with good insight to what we can improve to provide us with the most significant improvement to our model. On the other hand, Error analysis is incapable of distinguishing whether something like stemming or how we vectorize the text is beneficial to the final solution, and the only solution is to try it and evaluate the change on the evaluation metric. – An extract from Sysematically

Vector Representations

At first, we used the counts of the word occurrences to vectorize the tweets then used the Naive Bayes classifier to predict which class each tweet belongs to.

Algorithms From Scratch: Naive Bayes Classifier

The issue with this way of vectorizing our tweets is that it doesn’t take into account the context of the tweet and this could be detrimental to our classifier. For example, imagine _t_his is our tweet that we want to classify as a disaster or not "I had a good day today, but the weather was not good". A human could easily distinguish that this is a negative tweet, but remember a computer doesn’t see words, it sees numbers so we’d have to vectorize the tweet. Therefore, we vectorize the tweet and we decide to use the counts of each word which makes the assumption that each word is independent of one another which isn’t true, even though our Naive Bayes classifier did well on this task.

We can arrive at better vector representations by using word embeddings. Word embeddings (a.k.a word vectors) are a numerical representation of each word in a way that captures the semantic and syntactic underpinnings of that word. The encodings for the vectors are learned by taking the context of which the words appear into consideration, such that words that appear in a similar context would have similar word vectors. For instance, "tiger" and "lion" would be close together but they will be far away from "planet" and "castle".

What’s great about these word vectors is that we can find relationships between the words by doing mathematical operations with word vectors. If you do king – man + woman the result would be a vector that is closest to the vector for queen.

Figure 1: Mathematical operations used to find relationships between vectors (Image By Author)
Figure 1: Mathematical operations used to find relationships between vectors (Image By Author)

Word2Vec

In an attempt to improve our model, we will be implementing the Word2Vec algorithm which allows us to arrive at a distributed representation of words. There are 2 model architectures underlying the Word2Vec algorithm, the Continuous Bag of Words (CBOW) and Continous Skip-Gram.

Though similar in the sense that they both require a 2 layer neural network and they both take in a large corpus of text to produce a vector space with each word in the corpus corresponding to a vector in the space, the way the algorithm arrives at these vectors in vector space is quite different.

The CBOW model architecture predicts the current word from a window of surrounding context words which makes the assumption that the order of the context words does not influence the prediction.

Figure 2: The CBOW model architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)
Figure 2: The CBOW model architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)

Alternatively, the Continous Skip-Gram model architecture uses the current word to predict the surrounding window of context words, therefore, giving more weight to the context words that are nearby than the more distant ones.

Figure 3: The Skip-gram model architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)
Figure 3: The Skip-gram model architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)

Note: According to the author’s note, the CBOW is faster while the skip-gram is slower but does a better job for infrequent words. We will implement both and see how the model does.

The simplest way to implement a Word2Vec model is by using pre-trained word embeddings. Pre-trained models are simply word embeddings that were trained on another dataset – for example, the entire corpus of Wikipedia or Google News Dataset – and we can load these embeddings and use them for our task.

The most obvious advantage of this is that we can leverage a massive dataset that is built using billions of different words, and it removes the need for us to have to extract, clean, and process these large datasets. On the other hand, they may not capture the peculiarities of the language in the specific domain. People generally use informal language on Twitter (i.e. "Wyu2", "OMW", etc) and these may not be captured by the pre-trained models.

Nonetheless, we will be training our own word embeddings from our corpus. I’ve used Gensim to apply the Word2Vec algorithm which is a popular Natural Language Modelling and Topic modeling framework.

Please visit my Github Portfolio for the full script.

kurtispykes/twitter-sentiment-analysis

Evaluating our Word Embeddings

There’s are two ways we can evaluate our word embeddings to determine the quality of them, intrinsic and extrinsic.

Intrinsic evaluations of word embedding are when we evaluate a set of word embedding that has been generated on a specific intermediate subtask like analogy completion. Analogy completion consists of example term pairs and a query, for instance, "London is to England as Paris is to" and the task would be to correctly fill in the blank.

Extrinsic evaluations of word embeddings are when we evaluate the word vectors that have been generated by applying them to the task at hand. For example, we are doing a sentiment analysis task, therefore when we generate our embeddings, the task is now to pass them into a classifier and evaluate the outcome of the task, hence, this form of evaluation is typically slower to compute and more elaborate than intrinsic.

In saying that, we are going to evaluate our word embeddings using extrinsic evaluation using the F1 score as the evaluation metric; We can see how our Word2Vec, specifically the Continuous Skip-Gram model architecture, performed on the training data with a Logistic Regression:

Logistic Regression
Fold 1
Train f1: 0.7321764582897763
Val f1: 0.7062706270627062
Fold 2
Train f1: 0.7255189767246802
Val f1: 0.7187765505522514
Fold 3
Train f1: 0.715187210769878
Val f1: 0.7360532889258952
Fold 4
Train f1: 0.7226537896283856
Val f1: 0.7138157894736842
Fold 5
Train f1: 0.7307772889168238
Val f1: 0.6845637583892618

Let’s also check how te model does with a Random Forest Classifier:

Random Forest Classifier
Fold 1
Train f1: 0.9922898997686971
Val f1: 0.7015437392795884
Fold 2
Train f1: 0.9917132395451916
Val f1: 0.7080419580419579
Fold 3
Train f1: 0.9922839506172839
Val f1: 0.7291311754684837
Fold 4
Train f1: 0.9930635838150289
Val f1: 0.6848381601362862
Fold 5
Train f1: 0.9923017705927637
Val f1: 0.6888694127957932

and lastly, SVM:

SVM
Fold 1
Train f1: 0.8672164948453608
Val f1: 0.725925925925926
Fold 2
Train f1: 0.8634655532359081
Val f1: 0.7072961373390558
Fold 3
Train f1: 0.8673998754928408
Val f1: 0.7487603305785123
Fold 4
Train f1: 0.8674147963424771
Val f1: 0.7133105802047782
Fold 5
Train f1: 0.8687202811660121
Val f1: 0.7007672634271099

For more information on these models, visit the algorithms from scratch series…

Algorithms from Scratch: Logistic Regression

Algorithms From Scratch: Decision Tree

Algorithms From Scratch: Support Vector Machine

Note: I wasn’t able to get around an error I was receiving when trying to use Naive Bayes – ValueError: Negative Values in Data Passed to MultinomialNB (Input X) – so I ended up scrapping the Naive Bayes classifier.

Given these results, I decided to submit the SVM output to Kaggle…

Figure 4: SVM leaderboard score
Figure 4: SVM leaderboard score

We weren’t able to improve on our leaderboard score. In the future, I will try out more powerful Machine Learning models such as LightGBM and Neural Networks, as well as extracting the word embeddings using different techniques. Once I’ve found a model I am happy with, I will then work on some feature engineering to try and squeeze the most out of the model before considering stacking and blending to push the score that little bit further.

Note: This is all about the leaderboard and I am not trying to build a real-world tool.

Thank you for reading to the end, connect with me on LinkedIn to keep in touch:

Kurtis Pykes – Data Scientist – Upwork | LinkedIn


Related Articles