BERT in computing the 2019 Cities Happiness Index
Africa over time has been riddled by negative connotations. I’m in no position to defend or support such assertions as in most cases, such views I believe are solely in the eyes of the proponents. That aside, I chose to take a data perspective in looking at the positives in the African continent more so on the overall citizenry happiness. Before, I delve deep in the analyses, please take sometime and read through the the top five 2019 positive stories from the continent as per BBC. Teacher Tabichi had to be there, may be I’m just being too Kenyan. The African gods won’t be happy if I fail mentioning my favorite choir’s performance on AGT, the Ndlovu Youth Choir as well as the crowning of Springboks as World Rugby champions.
In my quest to unravel this happiness theory, I penned down a few "research questions" that I believed would have led me to the end result:-
_1. Is it possible to get significant data about African countries to determine their happiness?
- What is the best computational methodology for the same?
- Are the results verifiable by any means? Any validation sets?_
Data about happiness index exists for most countries, but usually involves a combination of several metrics and I’m not willing to go that route at least for now. Hopefully, that will form the validation set in future, fingers crossed. I have trained many analytics models so the methodology wasn’t hard with the right data.
Data from Largest Cities by population
I ended up getting randomized tweets from different cities in Africa based on their population and personal preferences. Therefore, expect a little bit of skewness in the data and personal preferences but hey, results are still definitive. The most populous cities in this instance are listed here . I ended up blending in a few East African cities despite their population sizes. Abidjan, Addis Ababa, Bujumbura, Cairo, Dar Es Salaam, Johannesburg, Kampala, Kigali, Kinshasa, Lagos, Luanda, Mogadishu and Nairobi were of interest in the analysis.
Tweets Collection
Geolocation is vital in the collection of tweets disseminated from these cities. Twitter has an advanced search feature thus this collection is possible to using the keyword "near". Take a look at Jefferson’s GetOldTweets repo for an in-depth search and collection of past tweets. You’ll thank me later. Collecting tweets say disseminated from Nairobi will be as simple as the below Python command.
With the above command, you’ll be able to collect up to 10M tweets disseminated in 2019 from/near Nairobi,Kenya.
I was interested in a smaller set of tweets from the cities thus just ended up collecting 16942 random and unique tweets. The challenge was that the tweets were in different languages contrary to the requirements of my training data and methodology of choice.
Tweets Pre-processing and Translation
Tweets unlike conventional English texts are unique as they are shortened to fit the character limits and the language of expression is unconventional in most cases. Therefore, pre-processing the same is different. I used Pandas to convert the CSV to a dataframe for easier manipulation.
The "text" column which is the tweet itself was of interest. Therefore, I dropped all other columns, removed duplicates, stop words as well as empty tweets. This happens a lot with tweets. The below code will take care of all those processes.
Unfortunately this process will leave several short tweets empty once again as shown below.
Work around is to drop nulls once again as below because it will be a waste of resources training a model on empty fields.
Output of the above process is below.
This makes sense. Unfortunately, our model was trained on English texts, thus the need to translate these tweets to English, more so from Francophone cities like Abidjan and Kinshasa. I made use of the GoogleTranslate API key to translate all tweets to English. It takes sometime but will get the job done using the below code:
Output was as below after concatenation with the cities field.
Bert in Training our model
Bidirectional Encoder Representations from Transformers (BERT) is a Natural Language Processing (NLP) technique developed by Google based on transformers. From Google’s blog , "this breakthrough was the result of Google research on transformers: models that process words in relation to all the other words in a sentence, rather than one-by-one in order. BERT models can therefore consider the full context of a word by looking at the words that come before and after it – particularly useful for understanding the intent behind search queries."
Since BERT models are already trained on millions of sentences and features, they perform quite well on many out-of-the box NLP tasks. I used Fast-Bert, an excellent simple wrapper for the PyTorch BERT module to train the model on a dataset of tweets , with text as the feature and sentiment category (0 to 4) as the label. The model was then tested on our translated set of tweets.
Prerequisites
Unfortunately BERT models are huge, with millions of features thus computational power to run the models was a big issue for me. I went for several cloud options and below are my thoughts on the few that I tried.
- Kaggle – Free platform with GPU support but memory allocation worked against me. Training process was killed because of memory related issues.
- Google’s Colab – Worked well but it too shut down severally after a many hours of training though was better for this task compared to Kaggle’s platform. Did not try the TPU capabilities as Pytorch is not supported yet.
- Google Cloud Platform – I ended up settling on this as I was able to use the $300 free credit they offer. I used a small fraction of the same so this is the best platform for anyone interested in training such a model. For a start, do not pay for GCP. Since you’ll be using Virtual Machines (VMs), just be content with ones high on CPUs and memory. Do not go for GPU with the paid ones. I tried this and ended up training nothing in addition to paying $68. Just follow this article to setup your notebooks on GCP. You’ll save lots of time.
Additional Tools:
Python 3 Fast-Bert PyTorch is required for Fast-Bert to train the model apex is a PyTorch extension used for distributed training used by Fast-Bert. Ignore this if using CPU. Ubuntu
I chose DistilBERT , a smaller, faster and cheaper model to train and set up the CPU version as I had no access to GPU with Google credits. The below code imported all necessary packages for model training in addition to segmenting the training and validation sets.
The Databunch
The first modeling step is to create the databunch. This is simply an object that takes training, validation and test CSV files and converts them into internal representation for BERT and its variants like DistilBERT. It also instantiates the correct data-loaders based on the device profile, batch_size and max_sequence_length.
The learner
The learning step follows with the databunch and related parameters as input. Below is a representation of the same in Python where the learner encapsulates the key logic for the lifecycle of the model such as training, validation and inference.
The model is then trained as below.
Factoring in training time and resources, I just set the number of epochs to 1. You can change this number for better accuracy (not guaranteed) if you have the resources. With 16 vCPUs and 64GB memory, training took about 19 hours. Saving the model is advised so that retraining is not done. The predictor object as below takes the model and labels as input.
Performing Sentiment Analysis on the Test Set
Batch sentiment prediction was of choice using the below code.
Output as shown below will be the label and probability of the tweet falling in the sentiment category. The sum of probabilities for each tweet’s sentiment output should be equal to 1.
Therefore, the label with the highest probability was the predicted label of choice. Based on the output above, the first tweet sentiment predicted label is 0 with a 25% probability. To select the labels with the highest probability in the list, the first line in the code below worked. The output was then converted to a dataframe in the second line.
Concatenation of the original test set and the predictions per tweet resulted in the below:-
The bitter-sweet moment
By now, you must be having a notion of what Happiness entails in the output. A sentiment label 4 shows contentment to a larger extent compared to the rest. Label 0 on the other hand shows dissatisfaction thus interpreted to be a sad feeling. Therefore, getting the mean of sentiment scores, grouped by cities is a simple but plausible answer to computation of the overall happiness of a city. The below code nails that.
Kigali, Rwanda takes the title of the happiest city of the 13. Surprisingly, Mogadishu, Somalia comes in second despite the negative stories we read about Somalia as a country. Nairobi, Kenya my capital is the second most unhappy city in East Africa after Kampala, Uganda.
Conclusion
The above results to some extent are an eye openner when it comes to using data driven solutions in deriving the happiness of people in certain countries or cities. A few points as I conclude:-
- The choice of the cities was purely based on population and personal preferences especially with the choice of East African cities. It’s the best/only way I can tell the African story. With time and more resources, I’ll be able to get more tweets from most capitals or countries in Africa and make better comparisons.
- The training set of 200K random tweets (though balanced in classes) is a little on the lower side. DistilBERT model too has fewer features(66M) which to a small extent compromises on the accuracy. I plan on implementing the same with a multilingual full model (110M features) with the entire 1.6 M tweets training set. Hopefully, there will be some notable difference.
- Translation especially for shortened and contextually challenging words in tweets is a challenge. The original context may be lost but the translator worked quite well on the few samples that human evaluators looked at.
- Validating our results. This is a tricky part. We based our data on social data, more so tweets. Tweets depict daily chatter in respective cities thus are best placed to measure overall happiness for the disseminators. However, the methodologies currently in place factor in several other factors in determining this happiness. For example, the UN world happiness report is one of such thus using it as ground truth will be valid if other factors are incorporated in the model. This is open for discussion.