The world’s leading publication for data science, AI, and ML professionals.

Determining the Right LDA Topic Model Size, Part II

Despite the large number of blog posts and articles written about LDA Topic Model implementations, precious few delve into the most…

Photo by Kier In Sight on Unsplash
Photo by Kier In Sight on Unsplash

Despite the large number of blog posts and articles written about Lda Topic Model implementations, precious few delve into the most important and enigmatic decision that has to be made by the developer: choosing the number of topics to model. This article and its companion piece Using Metrics to Determine The Right LDA Topic Model Size, provide an in-depth example of the systematic use of metrics to determine the best topic model size for a typical project.

Our heroes so far…

To summarize the first article: the task at hand is to create an LDA topic model that could be used by a general reader to categorize 30,000 news articles into roughly meaningful categories. In the first part of the exercise thirty topic sizes from five to one hundred and fifty in steps of five were evaluated. Each model was run three times (to average out the statistical noise inherent in the stochastic characteristics of the LDA algorithm) and for each run nine metrics were run: two PMI based metrics, CV and NPMI as well as seven metrics that evaluate each model’s top topic model words. Those interested in details about the corpus – a randomly selected 30,000 article subset of a larger set of news articles made available on Kaggle, as well as the first steps of the evaluation and other technical details, should review the older article. This article continues the process begun there by supplementing the metric based approach with a sampling of the topics and texts themselves.

In the first step nine metrics were used to identify a candidate set of topic sizes – 5, 10, 20, 35, 50 and 80. Cursory examination of the resultant models led to the quick removal of the 5, 50 and 80 topic models. The previous article concluded at this point. Here we will continue the evaluation and add three model sizes to get a clearer picture of the step-wise effects of differing model sizes. The models evaluated below are for sizes 10, 20, 30, 35, 40 and 45.

Topic and Fit Sampling

To perform the semantic analysis each of the one hundred and eighty topics were rated using a scale of 1–3 across two criteria: cohesiveness and comprehensibility. Cohesiveness is a judgment of how well the words work together and comprehensibility measures whether or not the top ten words work together to form an easily understood idea, subject or description. Topic clusters scoring a "1" were judged to be incoherent or comprehensible. Those scoring "2" where judged to be "somewhat" coherent or comprehensible and a score of "3" was granted to those topics that seemed "mostly" or better representative of their respective criteria. A similar scale was applied to a sampling of eighteen hundred randomly selected documents, ten from each topic for each of the six models.

Within each model the average topic coherence and comprehensibility scores were close. Here, and as we will see below, the Twenty and Fortyfive models score well. The topic evaluations are the only measures where the Thirty model is competitive with the Twenty and Fortyfive models.

Image by author.
Image by author.

While measuring topic cohesion and comprehensibility is a common way to evaluate topic model output, in this case the small sample sizes and the small relative differences between the models led the author to be hesitant to draw firm conclusions based on these results.

The next measure evaluated was how well a particular topic word cluster represents the semantics of an individual sample document. Here the Twenty and Fortyfive models do best.

Image by author.
Image by author.

To obtain this metric a total of eighteen hundred documents were individually reviewed and scored. In this measure the ten topic model fared the worst, presumably because it suffers from having too few topics into which to group a large set of documents. While these results overall, like the topic model cohesion/comprehensibility measures, have seemingly small divergences the sample sizes are much larger.

Distributions can Provide Clues

Another view of this data is a box plot which will give some sense of the distribution of averaged scores by topic. This graph shows that the Twenty model topic is more uniformly distributed even if its average score is slightly less than that of the Fortyfive model.

Image by author.
Image by author.

While the per document evaluation is a gold standard in topic model evaluation, and despite the fact that the Fortyfive topic model scored better than the Twenty, we must consider that this score may be statistically skewed. This has to do with the original decisions made on how to choose samples. The per document evaluation was designed to measure not only the overall model performance, but also to gain more insight into how each topic was represented. As a result, by choosing to sample ten documents from each topic the total number of documents – four hundred and fifty for the Fortyfive topic model and two hundred for the Twenty topic model – are not statistically equivalent. While 1.5% of the documents were sampled for the larger model, less than .7% of the corpus was sampled in the smaller. Furthermore, because the topics themselves are not evenly distributed across the documents (some topic clusters in the Fortyfive model are very larger/small), the results may further suffer statistically.

Dominance and Fit

It is possible to leverage the fit scores by correlating them with data we know is much more statistically trustworthy and recover from the sampling problem. Since we have the dominant contribution value (the single most likely topic for a given document) for each of the 30,000 documents it is possible to average those values out per topic and then compare them to the per topic average document fit ratings:

Image by author.
Image by author.

This graph shows a positive correlation between the document fit scores and the LDA generated dominant topic value for each of the 30,000 documents in the corpus. In this graph the averaged fit rating is compared to the averaged dominance likelihood for each topic. The trend line shows the positive correlation.

This insight implies that the dominant topic values are significant in determining the semantics of the model. Because we have we have values for each of the 30,000 documents, sampling issues don’t come into play. While this graph shows that the Ten model performs much better than the others, we already know that it is measuring overfit data and can be excluded. By plotting the overall dominance values by topic, we can see that the Twenty topic model has a clear advantage over the larger models:

Image by author
Image by author

Taking this data into consideration and weighing it with all of the other indications, the Twenty model seems to be the best fit for our purposes.

Conclusion

This exercise has come full circle and returned to the Twenty topic model that was identified by both PMI based metrics. These metrics have gained popularity because they are recognized as converging on topic sizes that match human topic model judgments. Intriguingly, it seems in this case there was a correlation of the dominant topic values and the semantics of the model. The Ten topic model was an outlier in this respect – it had high dominance values, but was not a good topic size. We might surmise that the dominance values are relevant as long as the topics themselves make sense. Having 100% certainty that a nonsensical topic cluster is present in a document isn’t a good outcome. On the other hand, if we have a topic cluster that makes good semantic sense, we would expect that the dominance values would be correlated with the semantic meaning. Interestingly, the non-PMI based metrics seemed, at least in this case, to do a poor job of identifying an optimal topic size. Hopefully the process described here has increased your understanding of the issues involved in determining the best topic model size for a particular project and given you ideas you can use to power your own work.


Related Articles