Hands-on Tutorials

How to Make Topic Models Interpretable: 3 New Ideas

Three Innovative Techniques for Tuning LDA Topic Model Outputs

Ramya Balakrishnan
Towards Data Science
6 min readApr 23, 2021

--

Topic modelling is an unsupervised machine learning approach which scans a set of documents, detects word and phrase patterns within them, and automatically clusters word groups and similar expressions that best characterize a set of text responses (or documents). To date, Latent Dirichlet Allocation (LDA) has been one of the most popular topic modeling techniques used widely in different industrial applications. A detailed explanation of LDA is provided here. LDA is a probabilistic model applied over a set of words and it has two key outputs:

i. A list of topics from a collection of documents. Each topic is a collection of words (that frequently co-exist) in the form of a probability distribution over the vocabulary for each topic.

Image by Divya Dwivedi [Source]

ii. A probability distribution over the identified topics for each document.

The probability distribution for a large sample are usually huge matrices that contain millions of numbers, which makes it challenging for social science researchers to extract key insights. In my case, I analyzed open ended text responses from the Center for Creative Leadership database of 43,000 leaders over a 10 year period where we asked them to respond to the following question: what are the 3 most critical leadership challenges you are currently facing? On average, leaders wrote about 100 words across the 3 challenges (Balakrishnan, Young, Leslie, McCauley, & Ruderman, 2021).

For the purpose of this blog, I will be using a subset of the challenge dataset from the Top Executive (C-suite leaders) leaders. I used LDA and implemented different techniques to better address the questions that social science researchers often ask about these data.

The key objective of this blog is to understand the relationships between the extracted topics, evaluate sub-group differences, and determine if and how topic prevalence may change over time. A step by step procedure for the above techniques is provided below.

1. Inter-correlation between Topics (Understand the relationships between the extracted topics)

2. Identifying meaningful and statistically significant difference between the topic proportion of two different subsets in a corpus (Evaluate sub-group differences)

3. Coffey Feingold Bromberg (CFB) measure to identify topic variability over time(How topic prevalence may change over time?)

Inter-correlation between Topics:

LDA derived topics can be inter-correlated with each other. For e.g., a document about genetics topic is more likely to be about infectious diseases topic than astronomy or politics. In order to determine this in my dataset, I examined the relationship between the topics. It is important to remember that the average topic probability distribution across all the topics in a document must add to 100%. In other words, if a document is high in a given topic, then theoretically they should be lower in other topics. In such cases, a mechanism to dichotomize the document topic matrix can be applied in such a way that the topic probabilities above the median are converted into 1’s and the probabilities below the median are converted into 0’s. Correlation coefficients can be calculated between the topics for the dichotomized matrix to determine the inter-correlation between the topics.

Figure 1: Topic Inter-correlations for Top Executives.

The above figure is the representation of inter correlation between the Topics which is about the challenges faced by Top Executive Leaders. It is more likely for a leader to report challenges related to ‘Organizational Readiness Amid Uncertainty’ if they report challenges related to ‘Organizational Talent Issues’ and vice versa.

The ties between topics represent correlations among topics greater than or equal to .16, and the size of the nodes indicate topic proportion. 0.16 was chosen based on Bosco and colleagues (2015) correlational benchmarks that found the median field level effect size in the applied psychology literature to be .16.

Identifying meaningful and statistically significant differences between the topic proportion of two different subsets in a corpus:

The leadership challenge dataset included a diverse set of leaders from 32 different industries. One of the objectives in my project was to compare the difference in topic proportions between different industries (e.g., what is the difference between Executives from Manufacturing (184 leaders) and Finance industry (211 leaders)). I conducted significance tests along with effect size testing to check if the difference was meaningful and statistically significant. Due to the skewness and unequal sample sizes, I chose Mann-Whitney test to determine the statistical significance. To determine practical significance, the r effect size measure was calculated. The topic probability difference is practically significant when the r effect size is at least .1 or higher according to Cohen’s standards (Cohen, 1988; 1992). Based on this analysis, 3 out of 6 topics showed significant differences between Education and Manufacturing leaders.

Table 1: Topic proportion and Stats for Education vs Financial industry Leaders
Figure 2: Topic probabilities for Education and Financial Services leaders

Coffey Feingold Bromberg (CFB) measure to identify topic variability over time:

CFB score is a measure of variability for a weighted set of proportions. The score ranges from 0 to 1, with higher values indicating higher variability. Formally, the Coffey measure C({p1, …, pn}, {w1, …, wn}) is given by the following formula:

where {p1, …, pn} is the set of proportions, {w1, …, wn} is the set of weights and {p1*, …, pn*} is a set of proportions that would yield maximal variance under the following constraint:

The challenge dataset of 43,000 leaders were collected from 2010–2020. In order to analyze the topic variation of challenge dataset over the past decade, I calculated the CFB measure for each topic. The below figure represents the CFB measure for challenge topics from 2010–2020.

Figure 3: CFB for Top Executives from 2010–2020

To better understand why CFB scores of Executives level had were more varied, we examined challenges by year. At the Leading the Organization level, there was a slight fluctuation observed with respect to ‘Dynamic Business Environment’ and ‘Strategic Responsibilities’ challenge from 2014–2018. Though it is difficult to explain why this may have occurred, both challenges remain consistently high for Executives.

Figure 4: Average Topic Proportion in % for Top Executives Model Over Time

The CFB measure is implemented in Bovens, Chatkupt and Smead (2012) to calculate the variability of recognition rates for asylum seekers in the EU. The implementation is in Mathematica (version 8). Click here for a Mathematica notebook which contains the functions that were used to perform the calculations of the Coffey measure that appear in Bovens, Chatkupt and Smead (2012).

Topic modeling is one of my favorite ways to explore themes in text data. There are different ways in analyzing the topic modeling results in addition to the above-mentioned ideas. The techniques explained above were useful for my line of research in I/O psychology. If you have any other interesting ideas or tips related to interpretation of topic modeling results, please let me know in the comments below.

References:

Balakrishnan, R., Young, S., Leslie, J., McCauley, C., & Ruderman, M. (2021). Leadership Challenge Ladder (LCL) Technical Report. Greensboro, NC: Center for Creative Leadership.

Bosco, F., Aguinis, H., Singh, K., Field, J., & Pierce, C. (2015). Correlational effect size benchmarks. Journal of Applied Psychology, 100(2), 431–449.

Bovens, L., Chatkupt, C. and Smead, L. (2012) Measuring common standards and equal responsibility sharing in EU asylum outcome data. European Union Politics

Coffey, M. P., Feingold, M., & Bromberg, J. (1988). A normed measures of variability among proportions. Computational Statistics & Data Analysis, 7(2), 127–141.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Laurence Erlbaum Associates.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.

--

--