TikTok has been on the rise in recent years globally especially among the Gen-Z population who are born between 1997 to 2015. Slang words like LOL (Laugh out loud), ROFL (rolling on floor laughing), and FOMO (fear of missing out) is what most of us know but does the rise of TikTok brought upon another set of slang words? Indeed, according to TheSmartLocal (Singapore’s leading travel and lifestyle portal), there were 9 TikTok slangs being identified as commonly used by Gen-Z kids. What do words like bussin, no cap & sheesh being used on the platform refer to? Do common English words like bestie, fax, or slaps still mean the same?
The article explained what each of the 9 slang words means, and how can they be used in conversation. In this article, we are going to analyze each of these slang words from a data science perspective where we will use Python to conduct some Natural Language Processing (NLP) techniques like sentiment analysis and topic modeling. This will give us a better idea of what are some associated words commonly used together with the slang words, the sentiments of the message, and the topics discussed together with these words.
Dataset
We are going to make use of another social media platform, Twitter, whereby tweets containing any of the 9 slang words will be captured for one month, which will form our entire dataset. The Twitter public API library on Python twitter
allows us to collect tweets for the last 7 days after obtaining our authentication of the consumer key and access token. Once we initialize the connection to Twitter, we set the latitude, longitude maximum range of area, and set the search query to the slang word. As I’m interested to extract tweets tweeted by users in Singapore, I’ve set the geographical location to Singapore. The favorite counts and retweet counts were also collected from Twitter.
# import libraries
from twitter import *
import pandas as pd
from datetime import datetime
# store the slang words in a list
slangs = ['fax', 'no cap', 'ceo of', 'stan', 'bussin', 'slaps', 'hits different', 'sheesh', 'bestie']
# twitter authentication
consumer_key = '...'
consumer_secret = '...'
access_token = '...'
access_token_secret = '...'
twitter = Twitter(auth = OAuth(access_token, access_token_secret, consumer_key, consumer_secret))
# set latitude & longitude to Singapore, maximum radius 20km
latitude = 1.3521
longitude = 103.8198
max_range = 20
# loop through each of the slang
for each in slangs:
# extract tweets with query containing the planning area; note max count for standard API is 100
query = twitter.search.tweets(q = each, geocode = "%f,%f,%dkm" % (latitude, longitude, max_range),
lang = 'en', count = 100)
# once done, loop through each tweet
for i in range (0, len(query['statuses'])):
# store the planning area, tweet, created time, retweet count & favorite count as a list variable
temp_list = [each, query['statuses'][i]['text'],
datetime.strptime(query['statuses'][i]['created_at'], '%a %b %d %H:%M:%S %z %Y').strftime('%Y-%m-%d'),
query['statuses'][i]['retweet_count'], query['statuses'][i]['favorite_count']]
# append list to tweets dataframe
tweets.loc[len(tweets)] = temp_list
Data Cleaning/Tokenisation
As the tweets collected were in the raw form which contains words like usernames, emoticons, and punctuations, data cleaning is necessary to remove them. We will start by converting all cases into lower case, filter off single word responses, remove punctuations/URLs/links.
# function to clean column, tokenize & consolidate into corpus list
def column_cleaner(column, slang_word):
# convert all to lower case and store in a list variable
corpus = column.str.lower().tolist()
# filter off single word responses, 'nil', 'nan'
corpus = [x for x in corpus if len(x.split(' ')) > 1]
corpus = [x for x in corpus if x != 'nan']
corpus = [x for x in corpus if x != 'nil']
# remove punctuations, links, urls
for i in range (len(corpus)):
x = corpus[i].replace('n',' ') #cleaning newline "n" from the tweets
corpus[i] = html.unescape(x)
corpus[i] = re.sub(r'(@[A-Za-z0–9_]+)|[^ws]|#|httpS+', '', corpus[i])
We then extend the above function column_cleaner
to tokenize (from RegexpTokenizer
function) the tweets into individual words, remove stopwords (from nltk
package)/digits and perform lemmatization using part-of-speech (from WordNetLemmatizer
function).
# empty list to store cleaned corpus
cleaned_corpus = []
# extend this slang into stopwords
stopwords = nltk.corpus.stopwords.words("english")
stopwords.extend([slang_word])
# tokenise each tweet, remove stopwords & digits & punctuations and filter to len > 2, lemmatize using Part-of-speech
for i in range (0, len(corpus)):
words = [w for w in tokenizer.tokenize(corpus[i]) if w.lower() not in stopwords]
cleaned_words = [x for x in words if len(x) > 2]
lemmatized_words = [wordnet_lemmatizer.lemmatize(x, pos = 'v') for x in cleaned_words]
cleaned_corpus.extend(lemmatized_words)
return cleaned_corpus
This whole function will then be run on our collected tweets dataset and each tweet will be tokenized which we will then be able to plot the top n words associated with the slang word on our visualization software (i.e. Tableau).
# loop through each slang word
for each in slangs:
# filter dataframe to responses with regards to this slang word
temp_pd = tweets.loc[tweets.slang == each, :]
# save result in temp pandas dataframe for easy output
temp_result = pd.DataFrame(columns = ['slang', 'word'])
# run column_cleaner function on the tweets
temp_result['word'] = column_cleaner(temp_pd['tweets'], each)
# add slang to slang column
temp_result['slang'] = each
# append temp_result to result
result = result.append(temp_result, ignore_index = True)

Sentiment Analysis/Polarity Score
We can make use of the Python package textblob
to conduct simple Sentiment Analysis where we will label each tweet as Positive if the polarity score > 0, Negative if the polarity score <0, else Neutral. Note that the above function column_cleaner
need not be run before we conduct sentiment analysis as the textblob
package can extract polarity scores directly from the raw tweets.
from textblob import TextBlob
# empty list to store polarity score
polarity_score = []
# loop through all tweets
for i in range (0, len(tweets)):
# run TextBlob on this tweet
temp_blob = TextBlob(tweets.tweets[i])
# obtain polarity score of this tweet and store in polarity_score list
# if polarity score > 0, positive. else if < 0, negative. else if 0, neutral.
if temp_blob.sentiment.polarity > 0:
polarity_score.append('Positive')
elif temp_blob.sentiment.polarity < 0:
polarity_score.append('Negative')
else:
polarity_score.append('Neutral')
# create polarity_score column in tweets dataframe
tweets['polarity_score'] = polarity_score

Topic Modelling
Next, we write a function that can perform Topic Modeling on our tweets. Topic modeling is a type of statistical modeling for discovering the abstract "topics" that occur in a collection of documents. We will use the common Latent Dirichlet Allocation (LDA) algorithm which is used to classify text in a document to a particular topic, and we can find it on sklearn
library package. In our function, we will also generate bigrams and trigrams using the individual tokens, and the top 3topics for each slang word.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
# function to conduct topic modelling
def topic_modeller(column, no_topic, slang_word):
# extend this slang into stopwords
stopwords = nltk.corpus.stopwords.words("english")
stopwords.extend([slang_word])
# set up vectorizer that remove stopwords, generate bigrams/trigrams
tfidf_vectorizer = TfidfVectorizer(stop_words = stopwords, ngram_range = (2, 3))
# set the number of topics in lda model
lda = LatentDirichletAllocation(n_components = no_topic)
# create a pipeline that vectorise and then perform LDA
pipe = make_pipeline(tfidf_vectorizer, lda)
# run the pipe on the cleaned column
pipe.fit(topic_column_cleaner(column, slang_word))
# inner function to return the topics and associated words
def print_top_words(model, feature_names, n_top_words):
result = []
for topic_idx, topic in enumerate(model.components_):
message = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
result.append(message)
return result
return print_top_words(lda, tfidf_vectorizer.get_feature_names(), n_top_words = 3)

Data Visualisation/Analysis
Once we get our dataset ready, we can plot our findings onto a data visualization software, i.e. Tableau. As this article focuses more on the steps needed to collect the data and generating our insights, I shall not discuss how I managed to plot my findings onto Tableau. You can refer to the Tableau dashboard on my Tableau Public profile.
Let’s take the slang word sheesh for example where we can filter our Tableau dashboard to that slang and the whole dashboard will be refreshed. Isn’t the idea cool where an iPhone wireframe was used to act as a user filter?

A total of 173 tweets were collected during the period of August 2021 in Singapore, and our polarity scores revealed that 31.8% of the tweets were positive, 47.4% neutral, and 20.8% negative. It seems to suggest the slang sheesh carries more of a neutral to positive meaning.

In our Python code, we tokenized our tweets so that we can rank the words in accordance with their frequencies among all the tweets containing that slang word. Our visualization shows that words like like, schmuck, and sembab (means swollen in Indonesian) seemed to suggest sheesh was used to further aggravate the impact of something.

Looking at the 3 topics modeled, our assumption that sheesh was used to aggravate the impact of something is further suggested in topics like craving murtabak, sounds good and cute girls.

Indeed, according to TheSmartLocal article, the word sheesh is used similarly as damn to express disbelief or exasperation. If we look at some of our tweets, "Sheesh craving for murtabak" and "Sheesh, he is a lucky person" do suggest the meaning.
Ending Note
I hope this article is interesting and gives you guys some ideas on how Natural Language Processing techniques like Sentiment Analysis and Topic Modeling can help us understand our series of documents (i.e. our tweets) better. Have fun playing with the Tableau dashboard and it was definitely fun doing up the dashboard, no caps sheesh!
