The world’s leading publication for data science, AI, and ML professionals.

Topic Modeling with Political Texts

Part II of NLP and Text Analytics series - utilizing LDA

Image by Author

Welcome to Part II of our NLP and Text Analytics series! If you haven’t already had a chance, please read Part I of this series here. To summarize our progress thus far, we began with a corpus of political writings such as "The Prince" by Machiavelli, "The Federalist Papers" by Hamilton/Madison/Jay, and "The Communist Manifesto" by Marx/Engels. From these works we derived text tables and used cosine similarity metric to determine clustering of the respective works as can be seen below.

In this next part we will look to build off our earlier result and try to find the most relevant topics for these clusters. The three clusters we identified will be referred to as "Old_West" for older western political philosophy (seen in red), "US" for the U.S. political philosophy (seen in green), and "Communist" for communist political philosophy (seen in orange). We’ll also build off of our Token table created in Part 1 of this series, a sample of which is shown below.

Topic ModelingFirst we should define what topic modeling is and the goal we hope to achieve by conducting this analysis. Topic modeling, as defined by Wikipedia, is "a type of statistical model for discovering the abstract ‘topics’ that occur in a collection of documents". Given that we are using political works, it is obvious to say that the main topics for each of these works will be related to political ideals, however through our use of topic modeling we will look to find the specific words found within each of the cluster’s top topics and look for potential relationships between our different clusters.

Our first step will be to shape our Token table into the format needed to conduct our modeling. To do this we first add cluster and author names as leading hierarchical terms for each token in the Token’s table, then we aggregate our individual tokens and sentences into a string, forming full paragraph strings. If we aggregate too much, say up to full chapter strings, many of our topics will be comprised identically, if we use too small of a sample such as with a sentences or individual term, our models fail to identify correct topics. Our new hierarchy should reflect the following:

# Create Topic Model OHCO
OHCO_TM = ['clusters', 'authors', 'book_id', 'chap_num', 'para_num']

Now that we have our table ready, we can begin applying our topic modeling algorithm.

tfv = CountVectorizer(max_features=4000, stop_words='english')
tf = tfv.fit_transform(PARAS.para_str)
TERMS = tfv.get_feature_names()
def model(PARAS = PARAS, tf = tf, TERMS = TERMS):
    Lda = LDA(n_components=25, max_iter=10, learning_offset=50)
    THETA = pd.DataFrame(lda.fit_transform(tf), index=PARAS.index)
    THETA.columns.name = 'topic_id'

    #PHI
    PHI = pd.DataFrame(lda.components_, columns=TERMS)
    PHI.index.name = 'topic_id'
    PHI.columns.name  = 'term_str'

    return THETA, PHI

In this code block we begin by using the CountVectorizer function, from the sklearn module, which goes through our newly created table, ‘PARAS’, creating a sparse matrix for the counts of the top 4000 terms. The number of terms can be adjusted in the parameter ‘max_feature’ passed to the CountVectorizer function. We will ultimately be using a Latent Dirichlet allocation (LDA) model to create our topics. The LDA model is commonly used in Topic Modeling and in our context works by identifying topics within a corpus of documents based on the terms used, then assigns these documents to the most suitable topics.

The LDA function above is found in the sklearn library. The first parameter we pass ‘n_components’ sets the number of topics we choose to create, in this case 25. The second term ‘max_iter’ refers to the number of iterations over the training data, and the ‘learning_offset’ serves to avoid errors caused by early learning in the training data.

As a result of our LDA function we get our outputs of Theta and Phi. The Theta matrix will be used to determine the distribution of topics in a given document, while Phi will be used to represent the distribution of words within a specific topic. Using the following code block we can do just that, meaning we can identify the words comprising each topic, and the most prevalent topics in each cluster.

# Create topics
TOPICS = PHI.stack().to_frame().rename(columns={0:'weight'})
    .groupby('topic_id')
    .apply(lambda x: 
           x.weight.sort_values(ascending=False)
               .head(10)
               .reset_index()
               .drop('topic_id',1)
               .term_str)
# Add label column to TOPICS table
TOPICS['label'] = TOPICS.apply(lambda x: str(x.name) + ' ' + ' '.join(x), 1)

# Set topics by Doc weight
TOPICS['doc_weight_sum'] = THETA.sum()
# Topics by book cluster given top 25 topics
topic_cols = [t for t in range(25)]
CLUSTERS = THETA.groupby('clusters')[topic_cols].mean().T                                            
CLUSTERS.index.name = 'topic_id'
CLUSTERS.T
CLUSTERS['topterms'] = TOPICS[[i for i in range(10)]].apply(lambda x: ' '.join(x), 1)

Now we can sort each of our three clusters to determine which topics are the most popular, and the words comprising these topics. As can be seen below, each of the clusters most identified with a different set topics and associated terms. As stated earlier, most the topics fall under the umbrella of political thoughts, which isn’t surprising given our domain, but within them we can see some distinction in the choice of words within the topics. For example, in topic 17 of the Communist cluster we can see that this topic is generally about laborers and/or labor conditions which again makes sense given what we know about those works, while topics most prevalent for the US cluster seem to reflect words foundational to our republic system.

Conclusion: As seen in the images above, we can use Topic Modeling to discover which topics are most prevalent in our cluster of texts. Topic Modeling can be a powerful tool used to distinguish between different sets of text data, while identifying the main topics and words associated with the data. There are however some drawbacks to Topic Modeling such as the potential need for experts to give an overall identity to each topic ID, and the repetitive nature of words within topics, especially those topics closely related to each other. The results achieved here are based on the specific parameters we chose for our LDA model. I encourage you to play around with the parameters to see how tuning can alter your results. This article serves as a basic demonstration of LDA and Topic Modeling and should be treated as such rather than a definitive proof of model accuracy/success. As always thank you for taking the time to read this article, please follow my channel if you’d like to read more!

The full code for this project can be found on my GitHub page at: https://github.com/nbehe/NLP_texts/blob/main/Beheshti_project_code.ipynb

This project is a small section of a larger project completed for my text analytics course. The code from this project is a mixture of work written by myself, provided by instructors, and provided through course materials.


Related Articles