“That’s Mental!” Using LDA Topic Modeling to Investigate the Discourse on Mental Health over Time

Charissa R.
Towards Data Science
7 min readMay 19, 2018

--

For this project I set out to investigate the contexts in which ‘mental health’ has been brought up over time. For this purpose, I collected ~30k New York Times articles from the 80s to present to analyze using topic modeling.

Tools

From beginning to end, I used the following tools:

  1. NYT ArticleSearch API to get historical metadata for articles that contain ‘mental health’ , and NYT Archive API to get total number of articles published by month
  2. BeautifulSoup to scrape body text for the ~30k NYT articles (don’t worry, I also pay for my monthly subscription)
  3. NLTK for preprocessing and cleaning of text
  4. AWS & Jupyter Notebook
  5. Gensim and Pandas for modeling and analysis
  6. Tableau and Google GraphViz Charts for visualizations

The Articles

Looking at the number of articles referencing mental health (the green area / right axis in the chart), we see that there has been an increase from about 750 articles per year in the 80s to around 1000 at present. Strangely, it seems like the total number of articles published by the New York Times has decreased over the coming years (something deserving of further research, another time).

Preprocessing

After having collected the raw article text for all articles, and after some initial text cleaning (removing tabs, whitespace, etc.), I applied the following preprocessing steps:

  • ✔ Named Entity Removal (manually excluded ~650 of 1000 most frequent named entities, such as cities, countries, institutions)
  • ✔ Tokenization
  • ✔ Removing punctuation
  • ✔ Lowercasing
  • ✔ Stopword removal (using 2 built-in lists, plus ~5 rounds of manual removal while modeling)
  • ✔ Lemmatization
  • ✘ Stemming (decided against, a small test showed significant loss of meaning, e.g. ‘witness’ turned into ‘wit’)
  • ✔ CountVectorizer to slim down the dictionary

Modeling

Latent Dirichlet Allocation

For the of analyzing the contexts in which mental health is referenced, I chose to perform topic modeling using LDA. Latent Dirichlet Allocation refers to the assumed probability distribution over the x topics you’re assuming are present in the dataset’s texts. This method is appropriate when in the text bodies you’re analyzing, you assume each body of text to only include a few topics. While running, the algorithm then adjusts the distribution of topics present in the body by learning from the collection of text bodies. In this case, the assumption is that NYT articles containing ‘mental health’ only include a couple of topics per article, justifying the use of LDA.

Setting the Hyperparameters

When running the model, you have to seta couple of parameters:

  • num_topics: how many topics you expect to be present in the entire dataset
  • passes: how many times to go through all the data to construct the topics. During these passes, the model then finds the best x groups of words that often occur together and seem separate from other groups of words–these are the topics.
  • alpha: the initial distribution of topics across the dataset from which the model starts learning. I set alpha to ‘auto’, so that it would learn from the priors from my dataset rather than assuming a symmetrical distribution. This unfortunately meant that I could not used the Multicore version of the LDA model, and therefore significantly increase the runtime of each of my test models. Alternatively, you can set alpha to ‘symmetric’, and then you can use the Multicore model.
  • eval_every: this is the frequency with which the topics are updated from the data that is passed to the model. I set this to 2000, meaning after every 2000 articles the model would update the topics.

Picking the Right No. of Topics

I ran exploratory models for up to 25 topics for 20 passes (the algorithm saw all the data 20 times). From repeatedly checking checking the most frequent couple dozen words associated with each topic, I moved on to models with 12–20 topics while passing the data 50 times, then 12–16 topics with 200 passes (took approximately 8 hours on my high power AWS instance–expect to run these overnight).

Picking the ‘right’ number of topics is a very manual task. You have to investigate each topic and the associated words manually, and also apply the topics back to the articles to see how well the different topics actually describe the articles. It’s a continuous back and forth until you find the ‘sweet spot’.

From the resulting models, I eventually chose the model with 14 topics as this resulted in the most distinct topics. This final model I ran one last time with 500 passes to make it converge. Below is an overview of the 5 most frequently occurring words within some of these 14 topics:

[(0,
[('game', 0.019290676),
('team', 0.014655967),
('player', 0.013197457),
('sport', 0.0096272184),
('play', 0.0086391047)]),
(1,
[('child', 0.071610503),
('school', 0.039327879),
('parent', 0.023915363),
('family', 0.020240301),
('student', 0.01847663)]),
(2,
[('life', 0.011472762),
('people', 0.011282165),
('woman', 0.0072893538),
('family', 0.0068212799),
('home', 0.0065922942)]),
(3,
[('art', 0.0068352227),
('book', 0.0065797614),
('film', 0.0050195237),
('street', 0.0049045617),
('life', 0.0043060104)]),
(4,
[('state', 0.024362968),
('president', 0.016713507),
('governor', 0.013491742),
('republican', 0.012855075),
('budget', 0.011843665)]),
(5,
[('court', 0.027797233),
('state', 0.021660788),
('case', 0.020019565),
('judge', 0.016594727),
('lawyer', 0.015601465)]),
(6,
[('woman', 0.033056237),
('gun', 0.028255416),
('law', 0.018436665),
('abortion', 0.014594568),
('violence', 0.014348775)]),
(7,
[('study', 0.01611951),
('disease', 0.012828876),
('brain', 0.012020445),
('health', 0.0082923695),
('percent', 0.0081615159)]),
(8,
[('patient', 0.038924955),
('mental', 0.020712202),
('treatment', 0.020659719),
('drug', 0.017774586),
('doctor', 0.01549102)]),
(9,
[('city', 0.025483603),
('state', 0.018783275),
('hospital', 0.016500311),
('people', 0.013982687),
('service', 0.012257091)]),
(10,
[('police', 0.031944588),
('officer', 0.017374802),
('man', 0.012399676),
('shot', 0.0080928812),
('death', 0.0073555997)]),
(11,
[('people', 0.011733172),
('life', 0.0056124884),
('question', 0.0054705907),
('work', 0.0054668467),
('social', 0.0050076959)]),
(12,
[('war', 0.015990684),
('veteran', 0.013699779),
('military', 0.011781215),
('government', 0.0091844629),
('country', 0.0084220674)]),
(13,
[('health', 0.024999907),
('care', 0.019101448),
('company', 0.015718631),
('state', 0.011817242),
('percent', 0.011760511)])]

After investigating up to 50 words within each topic group the model defined, I came up with the following crude topics:

Names for the 14 topics the model defined

Results

Topic Shares over Time

After having found the 14 topics that were present in the collection of text, I applied the model back to the articles, to get the percentages of each topic that were present in the individual articles.

Sample of article IDs with their topic shares

With a threshold of 0.15, the articles contained a maximum of 5 topics. Below is a sample of articles with the corresponding topic shares as determined by the LDA model.

Introductory paragraphs of a sample of articles with their corresponding topic shares (>0.15 threshold) as determined by the LDA model.

Based on these topic shares for each of the articles, I then aggregated the shares of topics per year, and compared each topic share versus the total of each year. The resulting visualization can be seen below:

From here, we can see that over time, family/home/work-life and sociological trends have taken up a larger share in the total number of articles mentioning ‘mental health’. Community programs/services seems to be less of a topic in the discourse over time.

Looking at a separate view of the absolute values for the topics over time, all of them have either increased in presence over time, or stayed the same, except for community programs/services.

Co-occurrence of Topics

I also took a bit of a closer look at the topics that occur together in the articles. Below is a clustered dendrogram visualization of the primary and secondary topics of the articles in the dataset.

Clustered Dendrogram representation of primary and secondary article topics resulting from the analysis. (click the linked image to go to the interactive visualization)

From the dendrogram we can see that e.g. police/shootings/murder are often mentioned in articles for which the primary topics is the justice-/penitentiary system. We can also see that therapy/treatment often gets accompanied by sociological trends, and medical research is closely tied to therapy/treatment in the articles. These co-occurrences display that some of the topics are still quite closely linked to each other, although they are clearly their own topics as well.

Extensions and Further Research

  • Sentiment analysis
    Now that we know how much of the conversation about mental health has to do with which topics, it would be interesting to do more of a deep dive into what the tone of the articles within different topics is. An extension I’ll be working on is to perform Sentiment Analysis within the categories to see if the way mental health in the respective categories is discussed, has changed over time.
  • Additional news outlets
    The findings are limited to the extent that only New York Times articles were considered. Introducing other news outlets could will neutralize the effect the selection of one particular news outlet has on the results of the analysis.

You can find the relevant code for this project in this GitHub repo. Thanks for your interest, more to come shortly. In the meantime, feel free to comment / ask away!

--

--

Data Scientist with a passion for social issues, health, and education - and the Oxford comma.