The world’s leading publication for data science, AI, and ML professionals.

Analyzing commonly used slang words on TikTok using Twitter

How can we use Sentiment Analysis and Topic Modeling on Python & Tableau to discover the meaning and usage of a slang word

Image by the author - Tableau Dashboard
Image by the author – Tableau Dashboard

TikTok has been on the rise in recent years globally especially among the Gen-Z population who are born between 1997 to 2015. Slang words like LOL (Laugh out loud), ROFL (rolling on floor laughing), and FOMO (fear of missing out) is what most of us know but does the rise of TikTok brought upon another set of slang words? Indeed, according to TheSmartLocal (Singapore’s leading travel and lifestyle portal), there were 9 TikTok slangs being identified as commonly used by Gen-Z kids. What do words like bussin, no cap & sheesh being used on the platform refer to? Do common English words like bestie, fax, or slaps still mean the same?

The article explained what each of the 9 slang words means, and how can they be used in conversation. In this article, we are going to analyze each of these slang words from a data science perspective where we will use Python to conduct some Natural Language Processing (NLP) techniques like sentiment analysis and topic modeling. This will give us a better idea of what are some associated words commonly used together with the slang words, the sentiments of the message, and the topics discussed together with these words.

Dataset

We are going to make use of another social media platform, Twitter, whereby tweets containing any of the 9 slang words will be captured for one month, which will form our entire dataset. The Twitter public API library on Python twitterallows us to collect tweets for the last 7 days after obtaining our authentication of the consumer key and access token. Once we initialize the connection to Twitter, we set the latitude, longitude maximum range of area, and set the search query to the slang word. As I’m interested to extract tweets tweeted by users in Singapore, I’ve set the geographical location to Singapore. The favorite counts and retweet counts were also collected from Twitter.

# import libraries
from twitter import *
import pandas as pd
from datetime import datetime
# store the slang words in a list
slangs = ['fax', 'no cap', 'ceo of', 'stan', 'bussin', 'slaps', 'hits different', 'sheesh', 'bestie']
# twitter authentication
consumer_key = '...'
consumer_secret = '...'
access_token = '...'
access_token_secret = '...'
twitter = Twitter(auth = OAuth(access_token, access_token_secret, consumer_key, consumer_secret))
# set latitude & longitude to Singapore, maximum radius 20km
latitude = 1.3521    
longitude = 103.8198    
max_range = 20
# loop through each of the slang
for each in slangs:
    # extract tweets with query containing the planning area; note max count for standard API is 100
    query = twitter.search.tweets(q = each, geocode = "%f,%f,%dkm" % (latitude, longitude, max_range),
                                  lang = 'en', count = 100)
    # once done, loop through each tweet
    for i in range (0, len(query['statuses'])):
        # store the planning area, tweet, created time, retweet count & favorite count as a list variable
        temp_list = [each, query['statuses'][i]['text'],
                     datetime.strptime(query['statuses'][i]['created_at'], '%a %b %d %H:%M:%S %z %Y').strftime('%Y-%m-%d'),
                    query['statuses'][i]['retweet_count'], query['statuses'][i]['favorite_count']]
        # append list to tweets dataframe
        tweets.loc[len(tweets)] = temp_list

Data Cleaning/Tokenisation

As the tweets collected were in the raw form which contains words like usernames, emoticons, and punctuations, data cleaning is necessary to remove them. We will start by converting all cases into lower case, filter off single word responses, remove punctuations/URLs/links.

# function to clean column, tokenize & consolidate into corpus list
def column_cleaner(column, slang_word):
    # convert all to lower case and store in a list variable
    corpus = column.str.lower().tolist()
    # filter off single word responses, 'nil', 'nan'
    corpus = [x for x in corpus if len(x.split(' ')) > 1]
    corpus = [x for x in corpus if x != 'nan']
    corpus = [x for x in corpus if x != 'nil']
    # remove punctuations, links, urls
    for i in range (len(corpus)):
        x = corpus[i].replace('n',' ') #cleaning newline "n" from the tweets
        corpus[i] = html.unescape(x)
        corpus[i] = re.sub(r'(@[A-Za-z0–9_]+)|[^ws]|#|httpS+', '', corpus[i])

We then extend the above function column_cleanerto tokenize (from RegexpTokenizer function) the tweets into individual words, remove stopwords (from nltk package)/digits and perform lemmatization using part-of-speech (from WordNetLemmatizer function).

# empty list to store cleaned corpus
    cleaned_corpus = []
    # extend this slang into stopwords
    stopwords = nltk.corpus.stopwords.words("english")
    stopwords.extend([slang_word])
    # tokenise each tweet, remove stopwords & digits & punctuations and filter to len > 2, lemmatize using Part-of-speech
    for i in range (0, len(corpus)):
        words = [w for w in tokenizer.tokenize(corpus[i]) if w.lower() not in stopwords]
        cleaned_words = [x for x in words if len(x) > 2]
        lemmatized_words = [wordnet_lemmatizer.lemmatize(x, pos = 'v') for x in cleaned_words]
        cleaned_corpus.extend(lemmatized_words)
    return cleaned_corpus

This whole function will then be run on our collected tweets dataset and each tweet will be tokenized which we will then be able to plot the top n words associated with the slang word on our visualization software (i.e. Tableau).

# loop through each slang word
for each in slangs:
    # filter dataframe to responses with regards to this slang word
    temp_pd = tweets.loc[tweets.slang == each, :]
    # save result in temp pandas dataframe for easy output
    temp_result = pd.DataFrame(columns = ['slang', 'word'])
    # run column_cleaner function on the tweets
    temp_result['word'] = column_cleaner(temp_pd['tweets'], each)
    # add slang to slang column
    temp_result['slang'] = each
    # append temp_result to result
    result = result.append(temp_result, ignore_index = True)
Image by the author - excel output after running column_cleaner function
Image by the author – excel output after running column_cleaner function

Sentiment Analysis/Polarity Score

We can make use of the Python package textblob to conduct simple Sentiment Analysis where we will label each tweet as Positive if the polarity score > 0, Negative if the polarity score <0, else Neutral. Note that the above function column_cleaner need not be run before we conduct sentiment analysis as the textblob package can extract polarity scores directly from the raw tweets.

from textblob import TextBlob
# empty list to store polarity score
polarity_score = []
# loop through all tweets
for i in range (0, len(tweets)):
    # run TextBlob on this tweet
    temp_blob = TextBlob(tweets.tweets[i])
    # obtain polarity score of this tweet and store in polarity_score list
    # if polarity score > 0, positive. else if < 0, negative. else if 0, neutral.
    if temp_blob.sentiment.polarity > 0:
        polarity_score.append('Positive')
    elif temp_blob.sentiment.polarity < 0:
        polarity_score.append('Negative')
    else:
        polarity_score.append('Neutral')

# create polarity_score column in tweets dataframe
tweets['polarity_score'] = polarity_score
Image by the author - excel output after conducting sentiment analysis
Image by the author – excel output after conducting sentiment analysis

Topic Modelling

Next, we write a function that can perform Topic Modeling on our tweets. Topic modeling is a type of statistical modeling for discovering the abstract "topics" that occur in a collection of documents. We will use the common Latent Dirichlet Allocation (LDA) algorithm which is used to classify text in a document to a particular topic, and we can find it on sklearnlibrary package. In our function, we will also generate bigrams and trigrams using the individual tokens, and the top 3topics for each slang word.

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
# function to conduct topic modelling
def topic_modeller(column, no_topic, slang_word):
    # extend this slang into stopwords
    stopwords = nltk.corpus.stopwords.words("english")
    stopwords.extend([slang_word])
    # set up vectorizer that remove stopwords, generate bigrams/trigrams
    tfidf_vectorizer = TfidfVectorizer(stop_words = stopwords, ngram_range = (2, 3))
    # set the number of topics in lda model
    lda = LatentDirichletAllocation(n_components = no_topic)
    # create a pipeline that vectorise and then perform LDA
    pipe = make_pipeline(tfidf_vectorizer, lda)
    # run the pipe on the cleaned column
    pipe.fit(topic_column_cleaner(column, slang_word))
    # inner function to return the topics and associated words
    def print_top_words(model, feature_names, n_top_words):
        result = []
        for topic_idx, topic in enumerate(model.components_):
            message = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
            result.append(message)
        return result
    return print_top_words(lda, tfidf_vectorizer.get_feature_names(), n_top_words = 3)
Image by the author - excel output after conducting topic modeling
Image by the author – excel output after conducting topic modeling

Data Visualisation/Analysis

Once we get our dataset ready, we can plot our findings onto a data visualization software, i.e. Tableau. As this article focuses more on the steps needed to collect the data and generating our insights, I shall not discuss how I managed to plot my findings onto Tableau. You can refer to the Tableau dashboard on my Tableau Public profile.

Let’s take the slang word sheesh for example where we can filter our Tableau dashboard to that slang and the whole dashboard will be refreshed. Isn’t the idea cool where an iPhone wireframe was used to act as a user filter?

Image by the author - Tableau screenshot of word filter
Image by the author – Tableau screenshot of word filter

A total of 173 tweets were collected during the period of August 2021 in Singapore, and our polarity scores revealed that 31.8% of the tweets were positive, 47.4% neutral, and 20.8% negative. It seems to suggest the slang sheesh carries more of a neutral to positive meaning.

Image by the author - Tableau screenshot of sentiment analysis
Image by the author – Tableau screenshot of sentiment analysis

In our Python code, we tokenized our tweets so that we can rank the words in accordance with their frequencies among all the tweets containing that slang word. Our visualization shows that words like like, schmuck, and sembab (means swollen in Indonesian) seemed to suggest sheesh was used to further aggravate the impact of something.

Image by the author - Tableau screenshot of top 5 associated words
Image by the author – Tableau screenshot of top 5 associated words

Looking at the 3 topics modeled, our assumption that sheesh was used to aggravate the impact of something is further suggested in topics like craving murtabak, sounds good and cute girls.

Image by the author - Tableau screenshot of topic modeling
Image by the author – Tableau screenshot of topic modeling

Indeed, according to TheSmartLocal article, the word sheesh is used similarly as damn to express disbelief or exasperation. If we look at some of our tweets, "Sheesh craving for murtabak" and "Sheesh, he is a lucky person" do suggest the meaning.

Ending Note

I hope this article is interesting and gives you guys some ideas on how Natural Language Processing techniques like Sentiment Analysis and Topic Modeling can help us understand our series of documents (i.e. our tweets) better. Have fun playing with the Tableau dashboard and it was definitely fun doing up the dashboard, no caps sheesh!


Related Articles