Simulated Annealing For Clustering Problems: Part 2

Published in

Towards Data Science

8 min readNov 6, 2018

Hey everyone, This is the second and final part of this series. In this post, we will convert this paper into python code and thereby attain a practical understanding of what Simulated Annealing is, and how it can be used for Clustering.

Part 1 of this series covers the theoretical explanation of Simulated Annealing (SA) with some examples. I recommend you to read it. However, if you are very knowledgeable or familiar with SA, I think you can understand most of what you are going to read, but you can always go to Part 1 and read it if anything is uncertain to you. I’m not going to explain the complete paper, only the practical part.

So, let us start by knowing what clustering means. Clustering is a type of unsupervised learning technique where the goal is to attach labels to unlabeled objects based on the similarities between the objects. Let’s quickly jump into a real-world example where clustering can be used so that you will have a better understanding. Suppose there is a Shopping Website where you can view clothes, accessories and buy them if you like.

Suppose these are the items that are available on the shopping website that you visited. When you repeatedly viewed few items like Shoes, Shirts, Jeans, and Watches, you would start getting more and more varieties of these items shown to you as recommendations. Did you ever observer that? Now, How does that happen? This is how it happens: The website will have a count of the number of times you are viewing a product. Based on this number, The items are clustering into like and dislike. When you are viewing some products repeatedly, maybe you are trying to buy that you like that kind of products which are similar. So the website puts Shirts, Shoes, Jeans, and Watches under the “like” cluster ( in other words label them as Like) and put the other items which have less number of views under “dislike” cluster. These labels are just examples. They can be named anything like “May buy” or “May not Buy”, dependent on the developers of the website.

Sorry for my bad presentation skills but I hope you go the idea. Now other websites can use this information and show you ads regarding Shoes, Shirts etc. This is a very basic example. In real life, there can be more factors affecting the clusters (not just the number of times you viewed it).

Programmatically it will look something like this:

Based on the number of times you viewed the items, the items are clustered. Based on the factor upon which the clustering is happening is called Internal Clustering Criterion. Here the Internal Clustering Criterion is the number of times you have viewed an item. Another popular example of Internal Clustering Criterion is distance, which is used in K-Means Clustering (Refer to this post by Mubaris Hassan for a better explanation).

Now that you understood Clustering and Internal Clustering Criterion, Our goal is to do Clustering with a predefined Internal Clustering Criterion using Simulated Annealing.

Let us start implementing the code for the algorithm written in the paper using an example scenario. Let us assume there are 10 objects ( 1 to 10 ) and 6 labels ( A to F) with the initial state as follows:

initial_state = {1:’A’, 2: ‘B’, 3:’A’, 4:’B’, 5:’C’, 6:’B’, 7:’A’, 8:’None’, 9:’None’, 10:’None’ }

Here objects 1 to 7 are labeled while 8 to 9 are not.

Let us put the unused labels in a list

unused_labels = ['D', 'E', 'F']

The basic algorithm steps are:

A variable called “initial temperature” is set to a high and keeps on decreasing until a low-temperature value which is called “final temperature”.
While the temperature keeps on decreasing, the “initial state” is taken as “current state”.
An action is done on the current state and its energy is calculated based upon on the internal clustering criterion defined. If this “perturbed state”(disturbed state) is different from the current state, then it is chosen as the “next state”.
The difference between the energies of the next state and the current state is calculated.
The value of the next state is put into the current state only if the energy difference is less than zero or if the “probability factor”(which depends on current temperature) is greater than or equal to a random decimal number between 0 and 1 ( or a random integer between 0 to 10).
This ends depending upon the temperature value and a parameter called “Max Iterations”.

There are parameters like “Average cost increase” over the perturbations, “Number of cost increases” in Max Iterations of random perturbations, “acceptance ratio”, “probability epsilon”(don’t get confused it with the Probability factor mentioned above) and “Beta” (0< beta < 1), using which the initial and final temperatures are calculated.

initial_temperature = avg_cost_increase / math.log( num_cost_increases / ((num_cost_increases * acc_ratio) — (1-acc_ratio) * (max_iter — num_cost_increases)))final_temperature = -beta * avg_cost_increase / math.log(prob_e)

“alpha” is defined as the decay rate at which the temperature keeps on decreasing.

alpha = math.pow(final_temperature / initial_temperature , 1 / num_temp)

The values for Max Iertarions, Acceptance Ratio, Probability Epsilon, beta are:

avg_cost_increase = 200
acc_ratio = 0.75 
prob_e = 0.00000000001
beta = 0.125
max_iter =  4 * num_objects

All these values and formulae are taken from the paper itself. I don’t want to explain all the parameters here because we will have to dig deeper and I thought that would slightly misleading from our actual topic. However, if you are much interested, please see the references sections of the paper.

This is the code we did till now. Let us look at the Simulated Annealing function:

Before explaining this, I need to explain two functions called “value” which computes the energy of a state and “action_on” which disturbs the current state until there is a change and makes the perturbated state as next state.

This is the most important part of the code. We need to calculate the energy of the state, and for that, we need to define when a state will high energy or low energy. Let us define that a state will have less energy when there are more objects labeled as F and high energy where there are less number of objects labeled as F. So the algorithm will always strive towards having the lowest energy possible for a state, therefore it will try to put maximum number of objects in the F cluster, in other words, label maximum number of objects as F.

So the Internal Clustering Criterion here is to have a maximum number of objects labeled as F. I took label F here. You can try with any other label. You can try different Internal Clustering Criterion like the number of objects labeled as B should be equal to the number of objects labeled as A, the number of objects labeled as A should be 1/3rd of the number of objects labeled as C, and many more such mathematical equations involving different labels. Please make sure you understood this paragraph as this is a crux point for you to implement this code in various situations.

Now let us look into the part of the code which makes action on the current state to produce the next state:

We initialize two lists L(used labels) and Lc ( L complement — list of unused labels) and copy the current state into a new variable called Perturbated State. Choose a random object and after entering into the loop, choose a random integer m between [0, |L|]. If m = 0 and there exists an unused cluster label(i.e. |L| < k), then the random object which is chosen earlier will get a random label from the list of unused labels. If the length of used labels and the total number of labels (k) are equal, then that means there are no unused labels and the chosen object would get a random label from used labels list.

Now we need to check if the perturbated state is different from the current state or not. There can be a scenario where an object gets the same label which it already has. Ex: object 3 has been randomly chosen and one of the conditions (m > 0 or length of l equals to k) is true and the label A is randomly chosen from the list of used labels. In this case, the perturbated state will be the same as the current state. There will be no change. So we will run this loop until the perturbated state is not equal to the current state and finally return the perturbated state.

So we take the perturbated state as the nex state, after that, we calculate the energy of the next state through the value function. We put the value of next state into the current state if the next state has less energy than that of the current state or if the probability factor is greater than or equal to a random integer number between 0 and 10.

This process keeps on running for (Max iterations multiplied by the difference of initial and final temperatures) number of times.

Here is the final code:

Here is the output:

Here you will have different values for “Energy of final state” each time because, on each execution, the algorithm chooses many values at random. You can do some tuning to the parameters to see how the results are affected. With sufficiently large Max Iterations and sufficiently small (1- acceptance ratio), beta and probability epsilon you can achieve better results.

So that is it guys, I tried my best to convert a small part of the paper into simple language. I hope you understood the whole concept. It’s ok if you find it difficult to process all of this at once. Please re-read this if anything is uncertain to you. Please leave a response if you did not understand anything or if you find any technical or grammatical mistakes. Also, don’t forget to try to run this program on your system and play with the parameters to achieve better results. You can Fork it here.

Thank you so much for your time. Have a great day!!

Simulated Annealing For Clustering Problems: Part 2

Written by Vinay Varma