Twitter Sentiment Analysis using fastText

Sanket Doshi
Towards Data Science
9 min readMar 5, 2019

--

In this blog, we’ll be analyzing the sentiments of various tweets using a fastText library which is easy to use and fast to train.

Twitter sentiment analysis

What is fastText?

FastText is an NLP library developed by the Facebook AI. It is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

Why fastText?

The main disadvantage of deep neural network models is that they took a large amount of time to train and test. Here, fastText have an advantage as it takes very less amount of time to train and can be trained on our home computers at high speed.

As per the Facebook AI blog on fastText, the accuracy of this library is on par of deep neural networks and requires very less amount of time to train.

comparison between fastText and other deep learning based models

Now, we know about fastText and why we’re using it we’ll see how to use this library for sentiment analysis.

Get Dataset

We’ll be using the dataset available on betsentiment.com. Tweets have four labels with values positive, negative, neutral and mixed. We’ll ignore all the tweets with the mixed label.

We’ll use teams tweet dataset as a training set while player dataset as a validation set.

Cleaning dataset

As we know, before training any model we need to clean data and it’s true here also.

We’ll clean tweets based on these rules:

  1. Remove all the hashtags as hashtags do not affect sentiments.
  2. Remove mentions as they also do not weigh in sentiment analyzing.
  3. Replace any emojis with the text they represent as emojis or emoticons plays an important role in representing a sentiment.
  4. Replace contractions with their full forms.
  5. Remove any URLs present in tweets as they are not significant in sentiment analysis.
  6. Remove punctuations.
  7. Fix misspelled words (very basic as this is a very time-consuming step).
  8. Convert everything to lowercase.
  9. Remove HTML tags if present.

Rules to clean tweets:

We’ll clean this tweet

tweet = '<html> bayer leverkusen goalkeeeeper bernd leno will not be #going to napoli. his agent uli ferber to bild: "I can confirm that there were negotiations with napoli, which we have broken off. napoli is not an option." Atletico madrid and Arsenal are the other strong rumours. #b04 #afc </html>'

Remove HTML tags

Sometimes twitter response contains HTML tags and we need to remove this.

We’ll be using Beautifulsoup package for this purpose.

If there are not HTML tags present than it will return the same text.

tweet = BeautifulSoup(tweet).get_text()#output
'bayer leverkusen goalkeeeeper bernd leno will not be #going to napoli. his agent uli ferber to bild: "I can confirm that there were negotiations with napoli, which we have broken off. napoli is not an option." Atletico madrid and Arsenal are the other strong rumours. #b04 #afc'

We’ll be using regex to match expressions to removed or to be replaced. For this, re package will be used.

Remove hashtags

Regex @[A-Za-z0-9]+ represents mentions and #[A-Za-z0-9]+ represents hashtags. We’ll we replacing every word matching this regex with spaces.

tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", tweet).split())#output
'bayer leverkusen goalkeeeeper bernd leno will not be to napoli. his agent uli ferber to bild: "I can confirm that there were negotiations with napoli, which we have broken off. napoli is not an option." Atletico madrid and Arsenal are the other strong rumours.'

Remove URLs

Regex \w+:\/\/\S+ matches all the URLs starting with http:// or https:// and replacing it with space.

tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split())#output
'bayer leverkusen goalkeeeeper bernd leno will not be to napoli. his agent uli ferber to bild: "I can confirm that there were negotiations with napoli, which we have broken off. napoli is not an option." Atletico madrid and Arsenal are the other strong rumours.'

Remove punctuations

Replacing all the punctuations such as .,!?:;-= with space.

tweet = ' '.join(re.sub("[\.\,\!\?\:\;\-\=]", " ", tweet).split())#output 
'bayer leverkusen goalkeeeeper bernd leno will not be napoli his agent uli ferber to bild "I can confirm that there were negotiations with napoli which we have broken off napoli is not an option " Atletico madrid and Arsenal are the other strong rumours'

Lower case

To avoid case sensitive issue

tweet = tweet.lower()#output
'bayer leverkusen goalkeeeeper bernd leno will not be napoli his agent uli ferber to bild "i can confirm that there were negotiations with napoli which we have broken off napoli is not an option " atletico madrid and arsenal are the other strong rumours'

Replace contractions

Remove contractions and translate into appropriate slang. There is no universal list to replace contractions so we have made it for our purpose.

CONTRACTIONS = {"mayn't":"may not", "may've":"may have",......}tweet = tweet.replace("’","'")
words = tweet.split()
reformed = [CONTRACTIONS[word] if word in CONTRACTIONS else word for word in words]
tweet = " ".join(reformed)
#input
'I mayn’t like you.'
#output
'I may not like you.'

Fix misspelled words

Here we are not actually building any complex function to correct the misspelled words but just checking that each character should occur not more than 2 times in every word. It’s a very basic misspelling check.

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))#output
'bayer leverkusen goalkeeper bernd leno will not be napoli his agent uli ferber to bild "i can confirm that there were negotiations with napoli which we have broken off napoli is not an option " atletico madrid and arsenal are the other strong rumours'

Replace emojis or emoticons

As emojis and emoticons play a significant role in expressing the sentiments we need to replace them with the expression they represent in plain English.

For emojis, we’ll be using emoji package and for emoticons, we’ll be building our own dictionary.

SMILEYS = {":‑(":"sad", ":‑)":"smiley", ....}words = tweet.split()
reformed = [SMILEY[word] if word in SMILEY else word for word in words]
tweet = " ".join(reformed)
#input
'I am :-('
#output
'I am sad'

For emojis

Emoji package return values for given emoji as :flushed_face: so we need to remove : from a given output.

tweet = emoji.demojize(tweet)
tweet = tweet.replace(":"," ")
tweet = ' '.join(tweet.split())
#input
'He is 😳'
#output
'He is flushed_face'

So, we’ve cleaned our data.

Why not use NLTK stop words?

Removing stop words is an efficient way while cleaning data. It removes all the insignificant words and usually is the most common words used in each sentence. To get all the stop words present in the NLTK library

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)
NLTK stop words

We can see that if NLTK stopwords are used than all the negative contractions will be removed which plays a significant role in sentiment analysis.

Formatting the Dataset

Need to format the data in which fastText requires for supervised learning.

FastText assumes the labels are words that are prefixed by the string __label__.

The input to the fastText model should look like

__label__NEUTRAL _d i 'm just fine i have your fanbase angry over
__label__POSITIVE what a weekend of football results & hearts

We can format our data using

def transform_instance(row):
cur_row = []
#Prefix the index-ed label with __label__
label = "__label__" + row[4]
cur_row.append(label)
cur_row.extend(nltk.word_tokenize(tweet_cleaning_for_sentiment_analysis(row[2].lower())))
return cur_row
def preprocess(input_file, output_file):
i=0
with open(output_file, 'w') as csvoutfile:
csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
with open(input_file, 'r', newline='', encoding='latin1') as csvinfile: # encoding='latin1'
csv_reader = csv.reader(csvinfile, delimiter=',', quotechar='"')
for row in csv_reader:
if row[4]!="MIXED" and row[4].upper() in ['POSITIVE','NEGATIVE','NEUTRAL'] and row[2]!='':
row_output = transform_instance(row)
csv_writer.writerow(row_output )
# print(row_output)
i=i+1
if i%10000 ==0:
print(i)

Here, we are ignoring tweets with labels other than Positive, Negative and neutral .

nltk.word_tokenize() converts string into independent words.

nltk.word_tokenize('hello world!')#output
['hello', 'world', '!']

Upsampling the dataset

In our dataset data is not equally divided into different labels. It contains around 72% of data in the neutral label. So, we can see that our model will tend to be overwhelmed by the large class and ignore the small ones.

import pandas as pd
import seaborn as sns
df = pd.read_csv('betsentiment-EN-tweets-sentiment-teams.csv',encoding='latin1')df['sentiment'].value_counts(normalize=True)*100
percentage of tweets for each labels
sns.countplot(x="sentiment", data=df)
countplot for sentiment labels

As the NEUTRAL class consists of a large portion of the dataset, the model will always try to predict NEUTRAL label as it’ll guarantee 72% of accuracy. To prevent this we need to have an equal number of tweets for each label. We can achieve this by adding new tweets to the minor class. This process of adding new tweets to the minority labels is known as upsampling.

We’ll achieve upsampling by repeating the tweets present in the given label again and again until the number of tweets is equal in each label.

def upsampling(input_file, output_file, ratio_upsampling=1):
# Create a file with equal number of tweets for each label
# input_file: path to file
# output_file: path to the output file
# ratio_upsampling: ratio of each minority classes vs majority one. 1 mean there will be as much of each class than there is for the majority class

i=0
counts = {}
dict_data_by_label = {}
# GET LABEL LIST AND GET DATA PER LABEL
with open(input_file, 'r', newline='') as csvinfile:
csv_reader = csv.reader(csvinfile, delimiter=',', quotechar='"')
for row in csv_reader:
counts[row[0].split()[0]] = counts.get(row[0].split()[0], 0) + 1
if not row[0].split()[0] in dict_data_by_label:
dict_data_by_label[row[0].split()[0]]=[row[0]]
else:
dict_data_by_label[row[0].split()[0]].append(row[0])
i=i+1
if i%10000 ==0:
print("read" + str(i))
# FIND MAJORITY CLASS
majority_class=""
count_majority_class=0
for item in dict_data_by_label:
if len(dict_data_by_label[item])>count_majority_class:
majority_class= item
count_majority_class=len(dict_data_by_label[item])

# UPSAMPLE MINORITY CLASS
data_upsampled=[]
for item in dict_data_by_label:
data_upsampled.extend(dict_data_by_label[item])
if item != majority_class:
items_added=0
items_to_add = count_majority_class - len(dict_data_by_label[item])
while items_added<items_to_add:
data_upsampled.extend(dict_data_by_label[item][:max(0,min(items_to_add-items_added,len(dict_data_by_label[item])))])
items_added = items_added + max(0,min(items_to_add-items_added,len(dict_data_by_label[item])))
# WRITE ALL
i=0
with open(output_file, 'w') as txtoutfile:
for row in data_upsampled:
txtoutfile.write(row+ '\n' )
i=i+1
if i%10000 ==0:
print("writer" + str(i))

As of repeating tweets, again and again, may cause our model to overfit our dataset but due to the large size of our dataset, this is not a problem.

Training

Try to install fastText with git clone rather than using pip.

We’ll be using supervised training method.

hyper_params = {"lr": 0.01,
"epoch": 20,
"wordNgrams": 2,
"dim": 20}

print(str(datetime.datetime.now()) + ' START=>' + str(hyper_params) )
# Train the model.
model = fastText.train_supervised(input=training_data_path, **hyper_params)
print("Model trained with the hyperparameter \n {}".format(hyper_params))

lr represents learning rate , epoch represents number of epoch , wordNgrams represents max length of word Ngram , dim represents size of word vectors .

train_supervised is a function used to train the model using supervised learning.

Evaluate

We need to evaluate the model to find it’s accuracy.

model_acc_training_set = model.test(training_data_path)
model_acc_validation_set = model.test(validation_data_path)

# DISPLAY ACCURACY OF TRAINED MODEL
text_line = str(hyper_params) + ",accuracy:" + str(model_acc_training_set[1]) + ",validation:" + str(model_acc_validation_set[1]) + '\n'
print(text_line)

We’ll evaluate our model on both training as well as validation dataset.

test returns precision and recall of model rather than accuracy. But in our case both the values are almost the same so, we’ll be using precision only.

Overall the model gives an accuracy of 97.5% on the training data, and 79.7% on the validation data.

Predict

We’ll predict the sentiment of text passed to our trained model.

model.predict(['why not'],k=3)
model.predict(['this player is so bad'],k=1)

predict lets us predict the sentiment of the passed string and k represents the number of labels to return with a confidence score.

Quantize the model

Quantizing helps us to reduce the size of the model.

model.quantize(input=training_data_path, qnorm=True, retrain=True, cutoff=100000)

Save model

We can save our trained model and then can use anytime on the go rather than training it every time.

model.save_model(os.path.join(model_path,model_name + ".ftz"))

Conclusion

We learn how to clean data, and pass it to train model to predict the sentiment of tweets. We also learn to implement sentiment analysis model using fastText.

--

--

Currently working as a Backend Developer. Exploring how to make machines smarter than me.