
When using K-means, we can be faced with two issues:
- We end up with clusters of very different sizes, some containing thousands of observations and others with just a few
- Our dataset has too many variables and the K-Means algorithm struggles to identify an optimal set of clusters
Constrained K-Means: controlling group size
The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)
The algorithm uses ideas from Linear Programming, in particular Network Models. Networks models are used, among other things, in logistics to optimise the flow of goods across a network of roads.

We can see in the simple figure above that we have 5 nodes with directed arcs (the arrows) between them. Each node has a demand (negative) or supply (positive) value and the arcs have flow and cost values. For instance, the arc 2–4 has a flow of 4 and a cost of $2. Similarly, node 1 supplies 20 units and node 4 requires 5 units.
Those problems can be solved by using optimisation algorithms such as the simplex algorithm.
At a very high level, we can represent this constrained K-means problem with a network, where we want to minimise the sum of the distances between each point and the centroid of its cluster, with some constraints on cluster sizes.
We can see what it would look like in the figure below. Every point x(i) is a supply node with a value equal to one and a directed arc to every single cluster C. The cost for those arcs is the distance between the point and the centroid of the corresponding cluster. The clusters C1, C2, etc. are demand clusters with values equal to the minimum size required. Finally we add an artificial demand node to make sure that the sum of the supplies equal the sum of the demands.

A more detailed explanation can found in the original paper.
A Python implementation
A package has been developed by Joshua Levy-Kramer and others, it is available here.
It can be installed with pip
pip install k-means-constrained
And then easily applied on a dataset, we’ll reuse the one from the previous articles but we need to transform our pandas dataframe into an array.
X=np.array(X).transpose()
And then we can fit the KMeansConstrained method to the data with the number of clusters we want (n_clusters), the minimum and maximum size of the clusters (size_min and size_max)
from k_means_constrained import KMeansConstrained
clf = KMeansConstrained(
n_clusters=4,
size_min=8,
size_max=12,
random_state=0
)
clf.fit_predict(X)print(clf.cluster_centers_)
print(clf.labels_)
We can then access the centroids with _clf.cluster_centers__ and the clusters with _clf.labels__
As we can see below, we obtain 4 clusters with 8 elements each.

For this example we used a dataset with two features so it can be nicely visualised in a 2D-scatterplot. In most cases, we’ll be dealing with datasets with a big number of features. Dealing with those in a Clustering context will be covered in the next section.
Feature selection for K-Means

In this section, we’ll see how to select variables when we are dealing with datasets containing a large number of variables. Dimension reduction usually helps find better clusters.
When we are dealing with high dimensional datasets, we can run into issues with clustering methods. Feature Selection is a well-known technique for supervised learning but a lot less for unsupervised learning (like clustering) methods. Here we’ll develop a relatively simple greedy algorithm to perform variable selection on the Europe Datasets on Kaggle.
The algorithm will have the following steps:
- Make sure the variable is numeric and scaled, for example using StandardScaler() and its fit_transform() method
- Choose the maximum of variables you want to retain (maxvars), the minimum and maximum number of clusters (kmin and kmax) and create an empty list: _selectedvariables.
- Loop from kmin to kmax. Then, using every variable in turn, record the silhouette value for every combination of variable and number of clusters (from kmin to kmax), using K-Means.
- Choose the variable giving the maximum silhouette value, add it to _selectedvariables and remove it from the list of variables to test.
- Repeat the process in 2 and 3 by using the _selectedvariables list and adding to it each remaining variable in turn, until some stop criterion has been reached (in this case the number of variables to keep, maxvars).
So for step 1 we define and initialise some variables.
maxvars=3
kmin=2
kmax=8kmeans_kwargs = {"init": "random","n_init": 20,"max_iter": 1000,"random_state": 1984}
cut_off=0.5
# We define a cols variables containing a list of all features:
cols=list(df.columns)
'''We set a list and a dictionary to store the silhouette values
for each number of clusters tested so we can choose the k value
maximising the silhouette score, with its corresponding features'''
results_for_each_k=[]
vars_for_each_k={}
We then create three nested loops, the outer one going through the values for k, the number of clusters. Then we have a while loop checking that the number of retained variables is below the threshold set by maxvars. The _selectedvariables list will hold the retained features names. The results list will hold the silhouette values for each variable.
for k in range(kmin,kmax+1):
selected_variables=[]
while(len(selected_variables)<maxvars):
results=[]
selected_variables=[]
print(k)
while(len(selected_variables)<maxvars):
results=[]
The inner loop runs goes through all the features one by one, adds them to the already selected variables if there are any, and evaluates the silhouette values. It then chooses the variable getting the highest value and adds it to the _selectedvariables list.
for col in cols:
scols=[]
scols.extend(selected_variables)
scols.append(col)
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df[scols])
results.append(silhouette_score(df[scols], kmeans.predict(s)))
''' We identify the best variable, add it to our list and remove it
from the list of variables to be tested on the next iteration '''
selected_var=cols[np.argmax(results)]
selected_variables.append(selected_var)
cols.remove(selected_var)
We then can update the variable lists and scores for this particular k value in our loop.
results_for_each_k.append(max(results))
vars_for_each_k[k]=selected_variables
Finally, after the three loops have run, we can identify the best combination of k and variables, fit the model and plot it.
best_k=np.argmax(results_for_each_k)+kmin
selected_variables=vars_for_each_k[best_k]
kmeans = KMeans(n_clusters=best_k, **kmeans_kwargs)
kmeans.fit(df_[selected_variables])
clusters=kmeans.predict(df[selected_variables])

If we choose 3 clusters, we get a different selection

Some examples of countries in each group:
Cluster1: Iceland, Switzerland, Belgium, Germany, Luxembourg, Netherlands, Austria and United Kingdom
Cluster 2: Greece, Spain, France, Croatia, Italy, Cyprus, Latvia, Lithuania, Hungary, Malta, Poland, Portugal
Cluster 3: Norway, Denmark, Finland and Sweden
Here is the whole code, including the 3d plot.
We can of course combine the variable selection with the constrained K-Means algorithm to force even clusters.
I hope this tutorial was helpful, feel free to comment if you have questions.
References
P. S. Bradley, K. P. Bennett and A. Demiriz, Constrained K-Means Clustering (2000)
Bradley, Hax, and Magnanti, Applied Mathematical Programming, Addison-Wesley (1977)