A Friendly Introduction to Text Clustering

The vast number of methods used for clustering words and documents can seem overwhelming at first, but let’s take a closer look.

Korbinian Koch
Towards Data Science

--

The topics covered in this article include k-means, brown clustering, tf-idf, topic models and latent Dirichlet allocation (also known as LDA).

To cluster, or not to cluster

Clustering is one of the biggest topics in data science, so big that you will easily find tons of books discussing every last bit of it. The subtopic of text clustering is no exception. This article can therefore not deliver an exhaustive overview, but it covers the main aspects. This being said, let us start by getting on common ground what clustering is and what it isn’t.

You just scrolled by clusters!

In fact, clusters are nothing more than groups that contain similar objects. Clustering is the process used for separating the objects into these groups.

Objects inside of a cluster should be as similar as possible. Objects in different clusters should be as dissimilar as possible. But who defines what “similar” means? We’ll come back to that at a later point.

Now, you may have heard of classification before. When classifying objects, you also put them into different groups, but there are a few important differences. Classifying means putting new, previously unseen objects into groups based on objects of which the group affiliation is already known, so called training data. This means we have something reliable to compare new objects to — when clustering, we start with a blank canvas: all objects are new! Because of that, we call classification a supervised method, clustering an unsupervised one.

This also means that for classifying the correct number of groups is known, whereas in clustering there is no such number. Note that it is not just unknown — it simply does not exist. It is up to us to choose a suitable amount of clusters for our purpose. Many times, this means trying out a few and then choosing the one which delivered the best results.

Kinds of clustering methods

Before we dive right into concrete clustering algorithms, let us first establish some ways in which we can describe and distinguish them. There are a few ways in which this is possible:

In hard clustering, every object belongs to exactly one cluster. In soft clustering, an object can belong to one or more clusters. The membership can be partial, meaning the objects may belong to certain clusters more than to others.

In hierarchical clustering, clusters are iteratively combined in a hierarchical manner, finally ending up in one root (or super-cluster, if you will). You can also look at a hierarchical clustering as a binary tree. All clustering methods not following this principle can simply be described as flat clustering, but are sometimes also called non-hierarchical or partitional. You can always convert a hierarchical clustering into a flat one by “cutting” the tree horizontally on a level of your choice.

The tree-shaped diagram of a hierarchical clustering is called a dendrogram. Objects connected on a lower hierarchy are more similar than objects connected high up the tree.

Hierarchical methods can be further divided into two subcategories. Agglomerative (“bottom up”) methods start by putting each object into its own cluster and then keep unifying them. Divisive (“top down”) methods do the opposite: they start from the root and keep dividing it until only single objects are left.

The clustering process

It should be clear how the clustering process looks like, right? You take some data, apply the clustering algorithm of your choice and ta-da, you are done! While this might theoretically be possible, it is usually not the case. Especially when working with text, there are several steps you have to take prior to and after clustering. In reality the process of clustering text is often messy and marked by many unsuccessful trials. However, if you tried to draw it in an idealized, linear manner, it might look like this:

Quite a few extra steps, right? Don’t worry — you would probably have intuitively done it right anyway. However, it is helpful to consider each step on its own and keep in mind that alternative options for solving the problem might exist.

Clustering words

Congratulations! You have made it past the introduction. In the next few paragraphs, we will look at clustering methods for words. Let’s look at the following set of them:

To us it immediately becomes apparent which words belong together. There should obviously be one cluster with animals containing the words Aardvark and Zebra and one with adverbs containing on and under. But is it equally obvious for a computer?

When talking about words with similar meaning, you often read about the distributional hypothesis in linguistics. This hypothesis states that words bearing a similar meaning will appear between similar word contexts. You could say “The box is on the shelf.”, but also “The box is under the shelf.” and still produce a meaningful sentence. On and under are interchangeable up to a certain extent.

This hypothesis is utilized when creating word embeddings. Word embeddings map each word of a vocabulary onto a n-dimensional vector space. Words that have similar contexts will appear roughly in the same area of the vector space. One of these embeddings was developed by Weston, Ratle & Collobert in 2008. You can see an interesting segment of the word vectors (reduced to two dimensions with t-SNE) here:

Source: Joseph Turian, see the full picture here

Notice how neatly months, names and locations are grouped together. This will come in handy for clustering them in the next step. To learn more about how exactly word embeddings are created and the interesting properties they have, take a look at this Medium article by Hunter Heidenreich. It also includes information about more advanced word embeddings like word2vec.

k-means

We will now look at the most famous vector-based clustering algorithm out there: k-means. What k-means does is returning a cluster assignment to one of k possible clusters for each object. To recapitulate what we learned earlier it is a hard, flat clustering method. Let’s see how the k-means process looks like:

Source: Chire, via Wikipedia (CC-BY-SA)

K-means assigns k random points in the vector space as initial, virtual means of the k clusters. It then assigns each data point to the nearest cluster mean. Next, the actual mean of each cluster is recalculated. Based on the shift of the means the data points are reassigned. This process repeats itself until the means of the clusters stop moving around.

To get a more intuitive and visual understanding of what k-means does, watch this short video by Josh Starmer.

K-means it not the only vector based clustering method out there. Other often used methods include DBSCAN, a method favoring densely populated clusters and expectation maximization (EM), a method that assumes an underlying probabilistic distribution for each cluster.

Brown clustering

There are also methods for clustering words that do not require the words to already be available as vectors. Probably the most cited such technique is brown clustering, proposed in 1992 by Brown et al. (not related to the brown corpus, which is named after Brown University, Rhode Island).

Brown clustering is a hierarchical clustering method. If cut at the right levels in the tree, it also results in beautiful, flat clusters such as the following:

adapted from: Brown et al. (1992)

You can also look at small sub-trees and find clusters that contain word pairs close to synonymity such as evaluation and assessment or conversation and discussion.

adapted from: Brown et al. (1992)

How is this achieved? Again, this method relies on the distributional hypothesis. It introduces a quality function describing how well the surrounding context words predict the occurrence of the words in the current cluster (so called mutual information). It then follows the following procedure:

  1. Initialize by assigning every word to its own, unique cluster.
  2. Until only one cluster (the root) is left: Merge the two clusters of which the produced union has the best quality function value.

This is the reason, why evaluation and assessment are merged so early. Since both appear in extremely similar contexts, the quality function from above still delivers a very good value. The fact that we start out with single element clusters that are gradually unified means this method is agglomerative.

Brown clustering is still used today! In this publication by Owoputi et al. (2013) brown clustering was used to find new clusters of words in online conversational language deserving their own part of speech tag. The results are entertaining, yet accurate:

Source: Owoputi et al. (2013)

If you are interested in learning more about how brown clustering works, I can strongly recommend watching this lecture given by Michael Collins at Columbia University.

While this paragraph concludes our section about clustering words, there are many more approaches out there not discussed in this article. One very promising and efficient way of clustering words is graph-based clustering, also called spectral clustering. Methods used include minimal spanning tree based clustering, Markov chain clustering and Chinese whispers.

Clustering documents

In general, clustering documents can also be done by looking at each document in vector format. But documents rarely have contexts. You could imagine a book standing next to other books in a tidy shelf, but usually this is not what large collections of digital documents (so-called corpora) look like.

The fastest (and arguably most trivial) way to vectorize a document is to give each word in the dictionary its own vector dimension and then just count the occurrences for each word and each document. This way of looking at documents without considering the word order is called the bag of words approach. The Oxford English Dictionary contains over 300,000 main entries, not counting homographs. That’s a lot of dimensions, and most of them will probably get the value zero (or how often do you read the words lackadaisical, peristeronic and amatorculist?).

Up to a certain extent you can counteract this by removing all word dimensions that are not used in your document collection (corpus), but you will still end up with lots of dimensions. And if you suddenly see a new document containing a word previously not used you will have to update every single document vector, adding this new dimension and the value zero for it. So, for the sake of simplicity, let’s assume our document collection does not grow.

tf-idf

Look at the following toy example containing only two short documents d1 and d2 and the resulting bag of words vectors:

What you can see is that words that are not very specific like I and love get rewarded with the same value as the words actually discerning the two documents like pizza and chocolates. A way to counteract this behavior is to use tf-idf, a numerical statistic used as a weighting factor dampening the effects of less important words.

Tf-idf stands for term frequency and inverse document frequency, the two factors used for weighting. The term frequency is simply the number of occurrences of a word in a specific document. If our document is “I love chocolates and chocolates love me”, the term frequency of the word love would be two. This value is often normalized by dividing it by the highest term frequency in the given document, resulting in term frequency values between 0 (for words not appearing in the document) and 1 (for the most frequent word in the document). The term frequencies are calculated per word and document.

The inverse document frequency, on the other hand, is only calculated per word. It indicates how frequently a word appears in the entire corpus. This value is inversed by taking the logarithm of it. Remember the ubiquitous word I we wanted to get rid of? Since the logarithm of one is zero, its influence is completely eliminated.

Based on these formulas, we get the following values for our toy example:

Looking at the last two columns, we see that only the most relevant words receive a high tf-idf value. So-called stop words, meaning words that are ubiquitous in our document collection, get a value of or close to 0.

The received tf-idf vectors are still as high dimensional as the original bag of words vectors. Therefore, dimensionality reduction techniques such as latent semantic indexing (LSI) are often used to make them easier to handle. Algorithms such as k-means, DBSCAN and EM can be used on document vectors, too, just as described earlier for word clustering. Possible distance measures include euclidean and cosine distance.

Latent Dirichlet allocation (LDA)

Often, just having clusters of documents is not enough.

Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time (Blei, 2012).

All topic models are based on the same basic assumption:

  • each document consists of a distribution over topics, and
  • each topic consists of a distribution over words.

Besides other topic models such as probabilistic latent semantic analysis (pLSA), latent Dirichlet allocation (LDA) is the best known and widest used one. Just by looking at its name, we can already find out a lot about how it works.

Latent refers to hidden variables, a Dirichlet distribution is a probability distribution over other probability distributions and allocation means that some values are allocated based on the two. To better understand how these three aspects come to play, let’s look at what results LDA gives us. The following topics are an excerpt of 100 topics uncovered by fitting a LDA model to 17,000 articles from the journal Science.

adapted from: Blei (2012)

How should you interpret these topics? A topic is a probability distribution over words. Some words are more likely to appear in a topic, some less. What you see above is the 10 most frequent words per topic, excluding stop words. It is important to note that the topics don’t actually have the names Genetics or Evolution. These are just terms we humans would use to summarize what the topic is about. Try to look at the topics as word probability distribution 1 or word probability distribution 23 instead.

But this is not all that LDA provides us with. Additionally, it tells us for each document which topics appear in it and to which percentage. For example, an article about a new device which can detect a specific disease prevalence in your DNA may consist of the topic mix 48% Disease, 31% Genetics and 21% Computers.

To understand how we can make a computer know what good topics look like and how to find them, we will again construct a little toy example. Let’s pretend our corpus vocabulary consists only of a couple of emojis.

Possible topics or word probability distributions on this vocabulary might look like this:

As humans, we might identify these topics as Foods, Smileys and Animals. Let’s assume the topics are given. To understand the underlying assumptions LDA makes, let’s look at the generative process of documents. Even if they don’t seem very realistic, an author is assumed to take the following steps:

  1. Choose how many words you want to write in your text.
  2. Select a mixture of topics your text should cover.
  3. For each word in the document:
  • draw a topic which the word should relate to from the mixture
  • draw a word from the selected topics word distribution

Based on these steps, the following document may be the result:

Note that even though a pizza emoji is much less likely to be drawn from topic 3, it is still possible to originate from it. Now that we have the result we want, we only have to find a way to reverse this process. Only? In reality, we are faced with this:

In theory, we could make our computer try out every possible combination of words and topics. Besides the fact that this would probably take an eternity, how do we know in the end which combination makes sense and which one doesn’t? For this, the Dirichlet distribution comes in handy. Instead of drawing the distribution like in the image above, let’s draw our document on a topic simplex in the respective position.

We can also draw a lot of other documents from our corpus next to this one. It could look like this:

Or, if we had chosen other topics beforehand, like this:

While in the first variant, the documents are clearly discernible and empathize different topics, the documents in the second variant are all more or less alike. The topics chosen here were not able to separate the documents in a meaningful way. These two possible document distributions are nothing else than two different Dirichlet distributions! This means we have found a way of describing what “good” distributions over topics look like!

The same principle applies to words in topics. Good topics will have different distributions over words, while bad topics will have about the same words as others. These two appearances of Dirichlet distributions are described in the LDA model by two hyperparameters, alpha and beta. As a general rule, you want to keep your Dirichlet parameters below one. Take a look at how different values change the distribution in this brilliant animation made by David Lettier.

Source: David Lettier

Good word and topic distributions are usually approximated by using techniques such as collapsed Gibbs sampling or expectation propagation. Both methods iteratively improve randomly initialized word and topic distributions. However, a “perfect” allocation may never be found.

If you want to try out what LDA does to a data set of your choice interactively, click your way through this great in-browser demo by David Mimno.

Remember the clustering flowchart presented in the introduction? Especially the last step called final evaluation? When using clustering methods you should always keep in mind that even though a specific model may results in the least vector mean movements or the lowest probability distribution value, this doesn’t mean that it is “correct”. There are many data sets k-means cannot cluster properly and even LDA can produce topics that don’t make any sense to humans.

In short: All models are wrong, but some are useful. Have fun clustering your texts and take care!

Resources

Anastasiu, D. C., Tagarelli, A., Karypis, G. (2014). Document Clustering: The Next Frontier. In Aggarwal, C. C. & Reddy, C. K. (Eds.), Data Clustering, Algorithms and Applications (pp. 305–338). Minneapolis: Chapman & Hall.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Blei, D. M. (2012). Probabilistic topic models: Surveying a suite of algorithms that offer a solution to managing large document archives. Communications of the ACM, 55(4), 77–84.

Brown, P. F., Pietra, V. J., Souza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-Based n-gram Models of Natural Language. Computational Linguistics, 18, 467–479.

Feldman, R., & Sanger, J. (2007). Clustering, The text mining handbook. Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.

Le, Q. & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(2), 1188–1196.

Manning, C., & Schütze, H. (1999). Clustering, Foundations of statistical natural language processing. Cambridge, Mass.: MIT Press.

Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., & Smith, N. A. (2013). Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters. Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9–14, 2013, Atlanta, Georgia, USA (pp. 380–390).

--

--