Topic Modeling of 2019 HR Tech Conference Twitter

Published in

Towards Data Science

9 min readNov 4, 2019

HR Technology Conference and Expo, the world’s leading and largest conference for HR and IT professionals, just took place in Las Vegas, from Oct 1–4, 2019. An incredible amount of HR technology topics were covered at the conference. Unfortunately, not everyone could be there, including myself. Is it possible to tell what the buzzwords and topics are without being there? The answer is YES! We dug into Twitter for some quick insights.

We scrape tweets with #HRTechConf, and build Latent Dirichlet Allocation (LDA) model for auto detecting and interpreting topics in the tweets. Here is the pipeline of the work:

Data gathering — twitter scrape
Text pre-processing
Visualizing keywords in word cloud
Training LDA model
Visualizing topics

We use Python 3.6 and the following packages:

TwitterScraper, a Python script to scrape for tweets
NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
Gensim, “generate similar”, a popular NLP package for topic modeling
Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data

Data Gathering

It is always challenging to scrape Twitter using its own API. One of the biggest disadvantages is that it only allows searching tweets published in the past 7 days. This is a major bottleneck for anyone looking for older past data to make a model from. Luckily, with TwitterScraper there is no such limitation.

from twitterscraper import query_tweets
# https://twitter.com/search-advanced
list_of_tweets = query_tweets('HRTechConf', 
         begindate=datetime.date(2019, 9, 26), 
         enddate=datetime.date(2019, 10, 6), 
         lang='en')
# Convert list of tweets to DataFrame
tweets_df = pd.DataFrame([vars(x) for x in list_of_tweets])

We scrape tweets that have a keyword ‘HRTechConf’ from 09/26/2019 to 10/06/2019. In 10 days, there are 7,274 tweets, with most tweets charged from Oct 1–3 (first 3 days of the conference).

Let’s take a look at some tweets.

Charles Mah of @Workday @ #HRTechConf highlights the essential role of change management in transforming your team and your tech stack. It’s often underestimated and under-resourced but is critical to your success as a TA leader.
@josh_bersin talks about massive growth in #talentacquisition. @jobvite thinks the space is crowded and fragmented with point solutions. check out our TA tech infographic. https://bit.ly/2nQdmVq #HRTechConf

Text Pre-processing

Before we can apply any machine learning algorithm, the Twitter data needs to be pre-processed so that tweets are transformed into a form that is analyzable. Here are the pre-processing steps:

Convert all words to lowercase
Remove non-alphabet characters
Remove short word (length less than 3)
Tokenization: breaking sentences into words
Part-of-speech (POS) tagging: process of classifying words into their grammatical category, in order to understand their roles in a sentence, e.g. verbs, nouns, adjectives, etc. POS tagging provides grammar context for lemmatization.
Lemmatization: converting a word to its base form e.g. car, cars and car’s to car
Remove common English words e.g. a, the, of, etc., and remove common words that add very little value to our analysis, e.g. hrtechconf, much, hr, get, etc.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
import nltkdef get_wordnet_pos(word):
    """
    Map POS tag to first character lemmatize() accepts
    """
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
   return tag_dict.get(tag, wordnet.NOUN)text_clean = text.lower()
# Remove non-alphabet
text_clean = re.sub(r'[^a-zA-Z]|(\w+:\/\/\S+)',' ', text_clean).split()    
# Remove short words (length < 3)
text_clean = [w for w in text_clean if len(w)>2]
# Lemmatize text with the appropriate POS tag
lemmatizer = WordNetLemmatizer()
text_clean = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in text_clean]
# Filter out stop words in English 
stops = set(stopwords.words('english')).union(additional_stop_words)
text_clean = [w for w in text_clean if w not in stops]

After pre-processing, the aforementioned tweets are transformed to:

[‘charles’, ‘mah’, ‘workday’, ‘highlight’, ‘essential’, ‘role’, ‘change’, ‘management’, ‘transform’, ‘stack’, ‘often’, ‘underestimated’, ‘resourced’, ‘critical’, ‘success’, ‘leader’]

[‘josh’, ‘bersin’, ‘massive’, ‘growth’, ‘talentacquisition’, ‘jobvite’, ‘space’, ‘crowd’, ‘fragment’, ‘point’, ‘solution’, ‘infographic’]

Word Count and Word Cloud

It is difficult to interpret the meaning of a single word. For example, machine could be a mechanical machine, machine gun, or machine learning. Instead, we use bigrams, pairs of consecutive words for word count. Here are the top 20 most frequent phrases in all extracted tweets.

What can we tell from this list?

“josh bersin” gets most tweeted — Josh Bersin is a keynote speaker.
ADP has a lot of Twitter coverage “adp platform”, “thrill adp”, “action adp”, “adp happens” — ADP is a Gold Sponsor of the conference and its human capital management (HCM) platform earned some top spotlights at the event.
“artificial intelligence” appears in many tweets.
“woman summit” is mentioned many times — This conference is also featuring Women in HR Tech.

Here is the word cloud:

Feature Extraction

Tweets or any text must be converted to a vector of numbers — the dictionary that describes the occurrence of words in the text (or corpus). The technique we use is called Bag of Words, a simple method of extracting text features. Here are the steps.

Bag of Words (Bigrams)

Count bigrams

A bigram is a sequence of two words e.g. machine learning, talent acquisition. Often, a bag-of-bigrams is more powerful and produce more meaningful insights than a bag-of-words (single word, or unigram).

Build a dictionary

We build the dictionary (or vocabulary) of the tweets in which all the unique bigrams of the tweets are given IDs and their frequency counts are also captured. We use Gensim library for building the dictionary.

We exclude all the tokens from the dictionary which have occurred in less than 10 tweets, so that we only deal with most representative words. Also bigrams that have occurred in more than 50% of the tweets are also removed e.g. near future, next year.

from gensim import corpora
tweets_dict = corpora.Dictionary(token_tweets)
tweets_dict.filter_extremes(no_below=10, no_above=0.5)

Rebuild corpus based on the dictionary

Now, using the dictionary above, we generate a word count vector for each tweet, which consists of the frequencies of all the words in the vocabulary for that particular tweet. Most of them are zeros, as a tweet does not have all the words of the dictionary.

bow_corpus = [tweets_dict.doc2bow(doc) for doc in token_tweets]

TF-IDF

Bag of Words is based on the principle that a document is represented by the words that occur most frequently, i.e. words with high “term frequency” (TF). A problem with this approach is that highly frequent words start to dominate in the document, but may not be representative to the model as less-frequent words.

One way to fix this is to measure how unique (or how infrequent) a word is across all documents (or tweets), which is called the “inverse document frequency” or IDF. By introducing IDF, frequent words that are also frequent across all documents get penalized with less weight.

In TF-IDF, we calculate TF x IDF as the weight for each word in the dictionary, representing how frequent this word is in the tweet multiplied by how unique the word is across all tweets. The TF-IDF weight highlights the words that are distinct (relevant information) in a tweet.

from gensim import models
tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

Topic Modeling Via LDA

Now we are ready to train our LDA model to learn topics from the tweets.

LDA, an unsupervised machine learning algorithm, is a generative statistical model that takes documents as input and finds topics as output. Each document is considered as a mixture of a number of topics, and each topic is determined by the frequency of the words (or word distribution). LDA is a popular probabilistic topic modeling algorithm. Here is an introduction to Latent Dirichlet Allocation.

We use Gensim library to train our LDA model over 7,000 tweets. Some model hyper-parameters to tune:

Number of topics: Each topic is a set of keywords, each contributing a certain weight (i.e. importance) to the topic. Too few topics result in non-cohesive topics, containing heterogeneous set of words, which are difficult to distinguish. Too many topics do not provide any semantic meaning of any topic. Given that it was a big event and tweets covered several days, the number of topics of all tweets is likely to be high. We set 30 as the number of topics. We also tried 10, 20 and 50, and none showed better result than 30.
Alpha (Document-Topic Density): The lower alpha is, the more likely that a tweet may contain mixture of just a few of the topics. Default is 1.0/NUM_TOPICS We used 0.001 as each tweet is quite short and is very likely to only have one topic.
Eta (Word-Topic Density): The lower eta is, the more likely that a topic may contain a mixture of just a few of the words. Default is 1.0/NUM_TOPICS We used the default.
Number of passes: Number of training passes/iterations over all tweets. We set it to be 50. We tried bigger numbers and they did not produce better results.

from gensim import models
NUM_TOPICS = 30
NUM_PASSES = 50 
ALPHA = 0.001
ETA = 'auto'
# Train LDA model
lda_model = models.ldamodel.LdaModel(corpus=tfidf_corpus, 
                                         num_topics=NUM_TOPICS, 
                                         id2word=tweets_dict, 
                                         passes=NUM_PASSES, 
                                         alpha=ALPHA, 
                                         eta=ETA,
                                         random_state=49)

Below are some example topics. The higher the numbers are, the bigger weights (importance) they contribute to the topic.

Topic: 20
Words: 0.823*"josh bersin" + 0.055*"opening keynote" + 0.018*"woman opening" + 0.012*"opportunity connect" + 0.007*"amaze event" + 0.006*"rebeccahrexec katieachille" + 0.006*"jeanneachille rebeccahrexec" + 0.006*"steveboese jeanneachille" + 0.005*"event feel" + 0.005*"lucky opportunity"

It seems this topic is about “josh bersin” addressing “opening keynote” and “woman opening”. After checking HR Tech Conf website, we can confirm Josh Bersin is a keynote speaker and Women in HR Tech opening happens at the same venue. Here is an example tweet:

Checking in from @Josh_Bersin’s opening keynote at #HRTechConf today! Our team loved hearing Josh share his insights on how technology is shaping and creating new opportunities for HR. pic.twitter.com/ku27LKzmcb

Topic: 3 
Words: 0.259*"kronos ceo" + 0.059*"rule follow" + 0.041*"kronos highly" + 0.041*"surprisingly simple" + 0.041*"simple rule" + 0.041*"reveals surprisingly" + 0.041*"replicate success" + 0.041*"workinspired kronos" + 0.041*"follow replicate" + 0.041*"culture reveals"

It seems this topic is about Kronos CEO reveals some surprisingly simple rule (of culture? for success?) Here is a tweet:

#hrconfes RT KronosInc: In #WorkInspired, Kronos CEO Aron Ain takes you inside Kronos highly admired culture — and reveals the surprisingly simple rules you can follow to start replicating that success. #HRTechConf

Topic: 7 Words: 0.187*"marcus buckingham" + 0.177*"book signing" + 0.138*"free copy" + 0.124*"nine lie" + 0.079*"present lie" + 0.079*"buckingham present" + 0.001*"peopledoc inc" + 0.001*"drive engagement" + 0.001*"around world" + 0.001*"answer question"

It seems this topic is about Marcus Buckingham having book signing and giving free copies (of Nine Lies? Book name maybe) Here is a tweet:

#hrtechconf book signing with Marcus Buckingham #FreeThinkingCoalition #9liesaboutwork https://www.instagram.com/p/B3KhdHWlRgRIFxreqnoIXDaZNthcdr-_TJ8aeo0/?igshid=19rvtzygznze1 …

Topic Visualization

To make it more interesting, we use pyLDAvis, an interactive LDA visualization package, to plot all generated topics and their keywords. pyLDAvis calculates semantic distance between topics and projects topics on a 2D plane.

import pyLDAvis.gensim
lda_data =  pyLDAvis.gensim.prepare(lda_model, corpus, dict, mds='mmds')
pyLDAvis.display(lda_data)

Here is the link to the interactive page.

Each bubble on the left represents a topic. The size of the bubble represents prevalence of the topic. The distance between the bubbles reflects the similarity between topics. The closer the two circles are, the more similar the topics are.

A good topic model should have some dominant bigger bubbles, with smaller ones scattered on the plane. If a plot has many overlapping small circles, typically it is a sign of having too many topics.

On the right hand side, it shows the top 30 most important bigrams of the topic. When hovering over a bubble, it will update the list of words on the right. Also, if you select a word from the list, it will highlight the circle(s) that the selected word appears.

Closing Notes

We scraped #HRTechConf tweets, generated a word cloud to show the buzzwords, and built an LDA model to lean the topics of the tweets.

LDA is difficult to train and results need human interpretation. However, it is very powerful and yet intuitive. Our experiment shows that the words of the learned topics are not perfectly similar or coherent but are definitely relevant to the topic.

Future works to improve the LDA results:

Optimize hyper-parameters using Grid Search
Formally determine the best number of topics for best modeling performance

All codes can be found on GitHub.

Happy Machine Learning!