Photo by Billy Huynh on Unsplash

K Means Clustering Without Libraries

— Using Python

Towards Data Science
7 min readAug 28, 2020

--

Kmeans is a widely used clustering tool for analyzing and classifying data. Often times, however, I suspect, it is not fully understood what is happening under the hood. This isn’t necessarily a bad thing if you understand what the end product conveys, but learning what happens by building the algorithm from scratch can certainly lead to a deeper understanding of the reasoning behind it.

I want to start out by emphasizing that the internet is an excellent place for coders and engineers. Answers and resources are widely available and merely a Google search away. To pretend I figured all of this out on my own would be silly. I readily acknowledge that there are times that it takes reading through others’ work on algorithms to understand how to approach it better. The beauty of code is that it can be written in many different ways, each emphasizing a slightly different quality. Utilize that in your learning.

Now that I’ve touched on that point, let’s dive in!

K Means Clustering is, in it’s simplest form, an algorithm that finds close relationships in clusters of data and puts them into groups for easier classification.

An example of data being clustered

What you see here is an algorithm sorting different points of data into groups or segments based on a specific quality… proximity (or closeness) to a center point.

Most often, Scikit-Learn’s algorithm for KMeans, which looks something like this:

from sklearn.cluster import KMeanskm = KMeans(
n_clusters=3, init='random',
n_init=10, max_iter=300,
random_state=42
)
y_km = km.fit_predict(X)

You may not understand the parts super well, but it’s fairly simple in its approach. What it basically does is it says we want 3 clusters, start with 10 iterations (or run-throughs, each one refining the clusters and positions), initialization for the 3 center points is random, maximum iterations is 300, and random state just means every time we run it, it will be the same. We then run the prediction. More can be read here on the different parameters that can be used.

So, how do we go about creating this code from scratch… especially if we’re not sure what’s going on? Let’s figure it out!

The first step is to think through and describe what is happening. First of all, this article does an excellent job describing each step. In summary, we plot k number of points (also called centroids) on a scatter plot, usually random, and find the data closest to those points. We then recalculate the average distance from the data to the centroids and the centroid position over and over until there are clear groups of data surrounding each of those k centroids.

Have I lost you yet? Hopefully not. Let’s walk through each process to see what is happening:

  • The first step is we need to decide how many clusters we want to segment the data into. There is a method to this, but for simplicity’s sake, we’ll say that we’ll use 3 clusters, or, k = 3. The code looks something like this:
k = 3
clusters = {}
for i in range(k):
clusters[i] = []

That look you see above just creates 3 empty clusters. It looks like this…

{0: [], 1: [], 2: []}

Simple enough, right?

We then set up the centroids in a similar fashion, but this time we use the data we are using. In my case, I’m using the Boston housing dataset. X, in this case, is an array of two points of data I’ve chosen from the dataset.

for i in range(k):
centroids[i] = X[i]

Next, we need to find the distance from each point of data to the centroid. The concept is simple enough, but this next chunk is a little confusing to look at initially. I recommend googling and reading through different parts of this to get a better idea of what is going on. For example, if you were to google “np.linalg.norm” you would find this page describing what it is and what it does.

for data in X:
euc_dist = []
for j in range(k):
euc_dist.append(np.linalg.norm(data - centroids[j]))
clusters[euc_dist.index(min(euc_dist))].append(data)

After we have initialized the centroids and clusters, we want to recalculate both of those values! Why? Because they are somewhat randomly initialized so we need to slowly move them toward the most ideal way that the data is naturally segmented (if at all, but that’s another discussion).

I have two functions written that do just that. Let’s take a look:

def recalculate_clusters(X, centroids, k):
""" Recalculates the clusters """
# Initiate empty clusters
clusters = {}
# Set the range for value of k (number of centroids)
for i in range(k):
clusters[i] = []
for data in X:
euc_dist = []
for j in range(k):
euc_dist.append(np.linalg.norm(data - centroids[j]))
# Append the cluster of data to the dictionary
clusters[euc_dist.index(min(euc_dist))].append(data)
return clusters
def recalculate_centroids(centroids, clusters, k):
""" Recalculates the centroid position based on the plot """
for i in range(k):
centroids[i] = np.average(clusters[i], axis=0)
return centroids

I hope you recognized some of the same code in those two functions. Pay close attention to the different parts and pieces. One of the best ways to learn something is to dissect what is going on inside. Again, I challenge you to google individual parts of this code. It’s like taking apart the radio and putting it back together. Bring out that inner engineer!

From there we would put together a plotting function that plots each cluster and assigns it a different color. It’s like feeding the data through a sorting machine that color codes the different pieces based on where they go. What comes out looks something like this:

We can see that the data is clearly segmented into different parts, though they aren’t distributed fairly well. That’s because this is merely the first iteration of the data! I should also mention that this shape doesn’t fully lend itself to clustering, which is a lesson in and of itself of both the strengths and weaknesses of an algorithm.

So what do we do when we want to have better distribution among the clusters?… Recalculate, recalculate, recalculate! In this case, if we ran it, say, 10 times, it would subsequently look like this:

There’s a little bit better distribution now, isn’t there. What we see is 3 different unsupervised classifications in the data. The algorithm is telling us that these 3 colors may mean something in the data that is at least worth looking into. Note that it doesn’t mean that it actually is significant. That’s the funny thing about data. Computers work hard to augment our abilities to find relationships in the data, but ultimately it is up to us to decide what those relationships (if any) mean. Your data science job is likely to stick around for at least a little while. Phew!

One of the more interesting sidenotes… when deciding how many iterations to run (or in other words, how many times you want to recalculate), you can put together what’s often called an “elbow plot” to see where the iterations really start to lose their ability to differentiate. Mine looked something like this:

You can see that around 3 to 4 repetitions is where it started losing that momentum that each iteration gave in adjusting those clusters and centroids. It’s a nice check on how many calculations you really want to run. After all, if this is for a company, time is money and resources are generally finite! Of course, this is pretty low-level stuff, so it’s not a big deal, but it’s always a conversation worth having with yourself and/or a team on projects and analyses.

To view the entire (in progress) notebook, navigate over to my GitHub repository, here! You’ll see some of the details in how I went about not only choosing a dataset but in initially exploring the dataset as well.

I hope you enjoyed this overview of K Means Clustering! Please note, I’m still fairly new to the world of data science, so I am absolutely open to the correction of ideas and methods detailed in this article. After all, I feel learning is continual and I don’t mind at all improving my abilities to explain and utilize these methods. We all learn through mistakes! Please reach out if you do feel there are any errors or points of clarity needed. Thank you!

--

--