K-means Clustering Algorithm

Getting under the hood of k-means, so that a 7th grader can understand

Keino Baird
Towards Data Science

--

The algorithm I choose to implement for this project was K-means clustering. The data generated attempted to model a water distribution scenario based on the distance each point is from a proposed water-well. Essentially, the algorithm is ideal for a customer segmentation scenario that clusters customers around a particular water-well based on their location and k numbers of wells.

In this imaginary scenario, several local municipalities have struggled for years with its aging water infrastructure. Supply, storage, movement of water from one point to another are a challenge. A data scientist was called in to build some models about how customers will be affected and to shed some light on improving water delivery supply.

The goal was to understand the algorithm; however, when variables have real meaning it helps us to get under the hood to see how it works and the math behind it.

Example Problem: How will customers cluster with respect to their location? Create a visual model for the placement of 3 and 7 water-wells, each well serving as the centroid.

The location of customers were fixed equidistance points, generated using the built-in range function. There were a total of 1392 points.

Image by Author

After plotting the fixed location of each customer, I found what I considered to be the geographical center of the dataset. This I assumed would be the ideal place for the first water-well.

Image by Author

For the placement of 3 water-wells with each well as a centroid, the customers will be clustered in the following segments. The feature used to predict the location of the new wells was the distance each point was from the centroid.

Image by Author

For the placement of 7 water-wells with each well a centroid, the customer will be clustered in the following way.

Image by Author

Given that the distance for all the individual points were equidistant from each other, applying the k-means clustering algorithm would make sense in this scenario. The distance formula was applied calculating the Euclidean distance between each point and the chosen centroid.

Next, I decided to add some context to the problem and included some feature engineering to see how the customers would clusters when the k-means algorithm is applied.

I added some random points, each representing a particular water company tasked with distributing water. Clearly, some assumptions were made on this imaginary dataset, but the overall was goal to see the k-means clustering algorithm in operation.

Assuming that each water company has a separate pricing structure for the production and distribution of water to customers based on how far the water has to travel from each well, I wanted to see how the k-means clustering algorithm will operate when other features were considered.

In this scenario, four clusters were created based on the pricing structure for one of the water companies, and the distance the water has to travel to reach each customer.

Image by Author

The k-means clustering algorithm was applied to calculate how customers can be segmented based on the profit margin — the difference between the production price and the sales prices.

Image by Author

The final application of the k-means clustering algorithm that I applied to this fictitious dataset clustered the customers in 3 segments. It used a feature called the line surcharge, a function that was applied to the data set based on the distance.

Image by Author

I spent approximately two weeks trying to get under the hood of the k-means clustering algorithm. Understanding how an algorithm works is best demonstrated when applied to a real-life conceptual scenario.

While this is currently out of the scope of the k-means clustering algorithm, logical next steps would include exploration with some classification algorithms models for each water company and some more assumption to the scenario such as:

Imagine each customer was randomly assigned to a well irrespective of distance after the public water company was suddenly privatized into seven companies. Each company has a different line surcharge based on the distance the water has to travel and a delivery schedule and amount of water usage based on the customer’s choice. Create an application that accurately allocates respective funds to each water company and a model for the effective and efficient water distribution to all customers.

--

--

Keino is a data nerd, a data science student at Lambda School and an educational consultant.