Using NLP and Text Analytics to Cluster Political Texts

NLTK, and scipy on text from Project Guggenheim

Nima Beheshti
Towards Data Science

--

Under the umbrella of Text Analytics there are many python packages that can help us analyze current and historical text in ways that yield interesting results. For this project I looked to classify a corpus of political writings, spanning over thousands of years, using cosine similarity clustering. The specific titles I used were:

· “The Prince” by Machiavelli

· “The Federalist Papers” by Hamilton, Madison, & Jay

· “The Condition of the Working — Class in England” by Engels

· “Utopia” by Moore

· “Anarchism and Other Essays” by Goldman

· “Leviathan” by Hobbes

· “The United States Constitution”

· “Considerations of Representative Governments” by Mill

· “The Communist Manifesto” by Marx & Engels

· “Politics” by Aristotle

· “Second Treatise of Government” by Locke

The first step was to download the corpus of text (downloaded directly from Project Guggenheim’s website). I then needed to import the files and find the start and end lines of each work within the text file. **Project Guggenheim files are downloaded as .txt file and have additional lines above and below the actual work.**

# For loop to find the start and end points of the books
for i in book_id:
...
a = df.line_str.str.match(r"\*\*\*\s*START OF (THE|THIS) PROJECT")
b = df.line_str.str.match(r"\*\*\*\s*END OF (THE|THIS) PROJECT")
...

Using a for loop and other string operations, I matched the common start and ending elements for these .txt files in order to determine the start and end points for each literary work.

Image by Author

Doc & Library Tables:
The regex portion below searches each book to separate out chapters in order to start creating an OCHO (Ordered Hierarchy of Content Objects) style table.


OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"
chap_pats = {
5: {
'start_line': 40,
'end_line': 680,
'chapter': re.compile("^ARTICLE\s+{}$".format(roman))
},
61: {
'start_line': 19,
'end_line': 1529,
'chapter': re.compile("^\s*Chapter\s+{}.*$".format(roman))
},
1232: {
'start_line': 1000, # This was mannually found through trial and error becuase the previous start had issues
'end_line': 4828,
'chapter': re.compile("^\s*CHAPTER\s+{}\.".format(roman))
},
...
}

Now that we have the breakdown of book_id and chapters, we can look to fill in the rest of each text’s hierarchy. We follow pandas and string operations to breakdown each chapter into paragraphs. The resulting table is referred to as the Doc table and will be used in further steps. We are also able to separate out the titles and authors to complete the Library table.

In the images below we can see a sample of the doc table showing the book id, chapter, paragraph number, and string. As well as the full Library table.

Image by Author
Image by Author

Token & Vocab Tables
In order to break the Doc table down further we need to define each individual term. The function below takes in our Doc table and parses out each paragraph’s string into sentences, followed by parsing these sentences into individual terms. This operation is primarily done through the NLTK package using nltk.sent_tokenize, nltk.pos_tag, nltk.WhitespaceTokenizer, and nltk.word_tokenize.

# Tokenize doc table to derive TOKEN table
def tokenize(doc_df, OHCO=OHCO, remove_pos_tuple=False, ws=False):

# Paragraphs to Sentences
df = doc_df.para_str\
.apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
.stack()\
.to_frame()\
.rename(columns={0:'sent_str'})

# Sentences to Tokens
# Local function to pick tokenizer
def word_tokenize(x):
if ws:
s = pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))
else:
s = pd.Series(nltk.pos_tag(nltk.word_tokenize(x)))
return s

df = df.sent_str\
.apply(word_tokenize)\
.stack()\
.to_frame()\
.rename(columns={0:'pos_tuple'})

# Grab info from tuple
df['pos'] = df.pos_tuple.apply(lambda x: x[1])
df['token_str'] = df.pos_tuple.apply(lambda x: x[0])
if remove_pos_tuple:
df = df.drop('pos_tuple', 1)

# Add index
df.index.names = OHCO

return df
TOKEN = tokenize(DOC, ws=True)# Creating VOCAB table and adding columns to TOKEN data
...
Image by Author

Having our Token table above does not mean we are done. We now need to build upon this table and create a Vocab table to index each of the terms used throughout our corpus of text. Using string & matrix operations, as well as the NLTK module’s PorterStemmer() function we further build out our Token table and create our Vocab table.

Below are samples of the Token and Vocab tables respectively. As can be seen in our Token table, each term now corresponds to a cataloged term_id which can be referenced using our new Vocab table. This table provides the term_id, term, stem of the term, number of occurrences within the overall corpus, stop-word distinction, and most common part of speech.

Image by Author
Image by Author

TFIDF Table:
Next, we need to use these tables to create a TFIDF table. The TFIDF table weighs each term on the frequency of use within each text and a frequency of use between documents in the corpus. The goal of this table is to give us a result of the most impactful and useful words for analysis, while reducing the impact of stop-words or other common words that would otherwise always be used such as ‘and’, ‘of’, ‘the’, etc. Notice we define different bag organizations in the start of this code block, these can be used to create TFIDF tables at various hierarchic elements.

SENTS = OHCO[:4]
PARAS = OHCO[:3]
CHAPS = OHCO[:2]
BOOKS = OHCO[:1]
def tfidf(TOKEN, bag, count_method, tf_method, idf_method):

#Create Bag of Words and DTCM
BOW = TOKEN.groupby(bag+['term_id']).term_id.count()\
.to_frame().rename(columns={'term_id':'n'})
BOW['c'] = BOW.n.astype('bool').astype('int')
DTCM = BOW[count_method].unstack().fillna(0).astype('int')

# TF calculations, will comment out all but 'sum' given that will be what I use
if tf_method == 'sum':
TF = DTCM.T / DTCM.T.sum()

TF = TF.T
DF = DTCM[DTCM > 0].count()
N = DTCM.shape[0]

# IDF calculations, will comment out all but 'standard' given that will be what I use
if idf_method == 'standard':
IDF = np.log10(N / DF)

# create TFIDF table
TFIDF = TF * IDF
return TFIDF
TFIDF = tfidf(TOKEN, BOOKS, 'n', 'sum', 'standard')# Add tfidf_sum column to Vocab table
VOCAB['tfidf_sum'] = TFIDF.sum()

As can be seen in the image of our TFIDF table below, we get a very sparse result given that we are currently including every word found in the corpus. We will need to reduce this further before attempting our final clustering analysis.

Image by Author

Additionally, after adding the summation principle, of this TFIDF table, to the Vocab table, and sorting the results for the highest tfidf_sum value, we get a table with the most impactful words:

Image by Author

Unsurprisingly, the most highly impactful words relate to political terms which fits what we would expect given the corpus was comprised of various political works and normal used words like ‘and’ or ‘the’ would be insignificant.

To help simplify our TFIDF table, we will reduce the number columns to the most popular 4000 ordered by the tfidf_sum column seen in the Vocab table above.

Clustering:
Now that we’ve derived our final table, we can begin the clustering process and see our results. This next block of code creates pairs between each of the works comparing their distance and similarity metrics clustering documents to the most similar/closest distance. For the purposes of this article, I will only be showing the cosine similarity cluster, but you can run the other tests included in this code block as well (cityblock, euclidean, jaccard, dice, correlation, and jensenshannon). The actual similarity/distance calculations are run using scipy’s spatial distance module and pdist function.

# Normalize the TFIDF table, create PAIRS and lists for PAIRS testing
def tab_doc_tab(TFIDF_table, Doc_table):
L0 = TFIDF_table.astype('bool').astype('int')
L1 = TFIDF_table.apply(lambda x: x / x.sum(), 1)
L2 = TFIDF_table.apply(lambda x: x / norm(x), 1)
PAIRS = pd.DataFrame(index=pd.MultiIndex.from_product([Doc_table.index.tolist(), Doc_table.index.tolist()])).reset_index()
PAIRS = PAIRS[PAIRS.level_0 < PAIRS.level_1].set_index(['level_0','level_1'])
PAIRS.index.names = ['doc_a', 'doc_b']
tfidf_list = ['cityblock', 'euclidean', 'cosine']
l0_list = ['jaccard', 'dice','correlation']
l1_list = ['jensenshannon']
for i in tfidf_list:
PAIRS[i] = pdist(TFIDF_table, i)
for i in l0_list:
PAIRS[i] = pdist(L0, i)
for i in l1_list:
PAIRS[i] = pdist(L1, i)
return PAIRS
PAIRS = tab_doc_tab(TFIDF_cluster, DOC)

Finally, we run the clustering function and display the results of the cosine similarity test. We use scipy module’s scipy.cluster.hierarchy function to create the framework of our clustering diagram, and use matplotlib to show the actual cluster below.

# Create Clusters
def hca(sims, linkage_method='ward', color_thresh=.3, figsize=(10, 10)):
tree = sch.linkage(sims, method=linkage_method)
labels = list(DOC.title.values)
plt.figure()
fig, axes = plt.subplots(figsize=figsize)
dendrogram = sch.dendrogram(tree,
labels=labels,
orientation="left",
count_sort=True,
distance_sort=True,
above_threshold_color='.75',
color_threshold=color_thresh
)
plt.tick_params(axis='both', which='major', labelsize=14)
hca(PAIRS.cosine, color_thresh=1)
Image by Author

The imagine above is the result in which we see the clustering of each work and how closely related each work is with one another. Based on our results we can see that cosine similarity clustering provided fairly accurate results. There are three main clusters which I will refer to as western political philosophy, US political philosophy, and communist political philosophy. Each text is properly clustered with like texts and some texts from the same authors are clustered closer together such as those by Engels.

Conclusion:
We began this project with .txt files for these various works, created various text tables such as Library, Token, Vocab, and TFIDF tables. Once we derived all tables we need, we ran a cosine similarity measurement and clustered the texts according to similar works. This project was ultimately able to find three clusters which we would expect given our knowledge of the authors, and topics of each political text.

The full code for this project can be found on my GitHub page at: https://github.com/nbehe/NLP_texts/blob/main/Beheshti_project_code.ipynb

*This project is a small section of a larger project completed for my text analytics course. I will write additional articles for other sections of that project in which we will go into other unsupervised models such as principal component analysis, topic modeling, and sentiment analysis. As is the case with text analytics, and especially with this article, most of the code is dedicated to creating the tables needed to run these models. I chose to keep all the code, as opposed to just some, so others could see all my steps. The code from this project is a mixture of work written by myself, provided by instructors, and provided through course materials.*

--

--