How Genetic Algorithms and Evolutionary Computing Can Be Used to Find Near-Optimal Solutions for the Travelling Salesman Problem
Last year, I took the ‘Genetic Algorithms and Evolutionary Computing‘ course at KU Leuven. The evaluation of the course is based entirely on a programming project in Python, where the task is to find near-optimal solutions for the travelling salesman problem. Conceptually, several matrices are given that represent the distances between certain cities, after which the algorithm should find the shortest solution. The designed algorithm had to be submitted, after which they will run for 5 minutes on the departmental computers.
1) Design of the evolutionary algorithm
1.1) The three main features:
- Fitness sharing has been used in the elimination step of the algorithm. This diversity promotion scheme is of crucial importance to avoid premature convergence, and hence makes sure that far better solutions can be found, instead of letting all individuals converge to one local minima.
- By introducing the 2-opt local search operator, far better solutions were found more quickly. Without the local search operator, much more iterations were required to find the same fitness values, along with a necessarily larger population. Although this operator is inherently extremely computationally expensive, it turns out to be pivotal for the algorithm. Especially in this operator, optimizations such as making use of dynamic programming and using Numba were decisive in making the operator computationally feasible.
- One last crucial improvement is the introduction of the greedy and legally initializations. Greedy initialization starts from a random node, and chooses the next one according to the smallest distance. The details of this initialization scheme are elaborated in Section 4.4, along with a consideration of the introduced biases. Furthermore, legal initialization simply chooses the next node from a random neighbor that has an existing road between them.
1.2) The main loop:
1.3) Representation
Possible solutions are represented as permutations and are written down in cycle notation. For example, the permutation (1423) starts at 1, then goes to 4, then 2, then 3, and returns to 1. An advantage of this notation is that no cycles are present as long as we initialize the representations as a permutation.
This representation is implemented in as a Numpy array with its length equal to the number of cities in the problem. Each element in the array represents a city number.
1.4) Initialization
Initially, individuals were generated by a random permutation, with their size determined from the distance matrix. However, especially for the larger problem sizes, quite a lot of paths were non-existing or extremely long. Hence, random initialization of all individuals yielded almost always individuals where none of them represented a valid path.
Two new initialization schemes were introduced, legal and greedy initialization, where the objective of legal initialization is to not create non-existing paths when initialing an individual. This while not introducing certain biases, such as individuals who take over the population immediately. The objective of greedy initialization, on the other hand, is introducing locally optimal individuals with a high fitness, in a computationally inexpensive way. Here, special care has been taken that the individuals don’t introduce high biases and won’t take over the population immediately.
1.4.1) Legal initialization
When initializing an individual legally, one makes sure that the generated path exists. Therefore, a city is chosen at random, after which the next city in the tour is chosen to be a random one out of all the neighbors with a non-infinite path cost. If, however, no existing neighbors are available anymore as next cities, the whole procedure restarts. The pseudo-algorithm is given in Algorithm 1.
1.4.2) Greedy initialization
Besides the legal initialization scheme, another initialization scheme named greedy initialization is used. The algorithm resembles the legal initialization scheme, with the only change that the successor city is chosen to be the closest neighbor, instead of a random legal neighbor. The pseudo-algorithm is given in Algorithm 2.
This initialization scheme does introduce certain biases, which could result in some individuals taking over the population immediately. Therefore, one has to take prevention against this bias, by for example only initializing a small fraction of all the individuals with this scheme.
The introduced bias is that all individuals are locally optimal, where in theory the maximum number of different solutions is given by the problem size. Given that it may be difficult to escape local minima, one must consider the usefulness of this initialization scheme.
However, after experimenting with this initialization scheme, it became apparent that good solutions for the problem where found much faster, while still maintaining a lot of diversity, along with a smooth convergence. With only a small portion of the individuals being initialized with this scheme, being totally stuck in a local minima was not observed, and for me it brought huge advantages, since the given five minutes could now be used to start searching in some way more interesting regions of the search space.
1.4.3) General aspects
The distance matrix can be given in such a way that greedy initialization gets stuck in an infinite loop, because greedy initialization may always construct a dead path (due to always taking the closest neighbor), starting from each node. To create a legal path in special cases, it should instead sometimes take sub-optimal paths to a neighbor to not end up in a dead path near the end of the initialization. To prevent the whole algorithm from crashing, a time constraint on the initialization of one individual has been introduced. Once an individual takes longer than two second to initialize, simple random initialization of that individual ensues.
An individual also gets assigned a random α value, which represents the probability that the individual will mutate in the mutation step of the algorithm. This way, a suitable mutation rate is determined by self-adaptivity.
The initial value of α is given as:
α = max(0.04, 0.20 + 0.08 · (X ∼ N (0, 1)))
After some testing with population sizes, a size of 15 had been chosen. Furthermore, as mentioned in Section 1.4.2, only a fraction of the population should be initialized greedily. Given that the population size is 15, I found that greedily initializing 20% of the individuals (i.e. 3 individuals) worked out quite well in practice. The remaining 80% of the individuals are initialized legally.
For large problem sizes, initialization could take up to 10 seconds. Since the initialization of an individual is totally independent of the other individuals, multiprocessing has been added to this step, which entailed a factor five speed improvement on a machine with four physical cores (eight virtual cores).
1.5) Selection operators
The k-tournament selection operator from the group phase part has been kept. This selection operator is computationally inexpensive, since only k fitness values have to be computed, while this would require µ fitness values in fitness-based methods. Furthermore, sigma-scaled selection would for example not have been an appropriate choice, since the greedy initialization scheme introduces some very good individuals in the population. These individuals would dominate in such a scheme, since their selection probability would be very high.
A k-value of 5 has been chosen after numerous experiments.
1.6) Mutation operators
The mutation operator used for the final implementation is the inversion mutation, whereby a random sub-vector is chosen and its order is reversed. A swap mutation operator that was used earlier, would not scale well to larger problems, since it only swaps two random locations. That mutation operator, as a consequence, had a relatively even smaller impact on the solution when the problem size increased.
Inversion mutation does not suffer from this scaling problem, since the cities that determine the sub-vector are randomly chosen. Hence, the effect of the mutation operator is constant for rising problem sizes.
Self-adaptivity has been used for the mutation rate, which is hence specific to each individual. It is initialized as described earlier, and it changes in crossover as described with the following two formulas:
β = 2 · (X ∼ N (0, 1)) − 0.5
α = max(0.04, α_parent_1 + β · (α_parent_2 − α_parent_1))
As a last note, elitism is used to prevent the best seed individual from mutating.
1.7) Recombination operators
Initially, a simplified version of the edge crossover operator was used as the recombination operator, for which the process is described in Algorithm 3 [3]. This recombination results in a new path where almost all the edges of the child were present in at least one of the parents. It does however not prioritize edges present in both parents over edges present in a single parent.
This algorithm is very simple and was the weakest part of the Genetic Algorithm. However, the algorithm still has some desirable features despite its simplicity. Edges present in both parents have a relatively high probability of propagating to the child, so important features are mostly preserved. On the other hand, when the parents are very different, the child will look fairly different from both parents. Hence, this operator moves a bit more to the exploration side then other operators.
The reason this simplified algorithm was implemented, instead of the proper one from Eiben & Smith [1], was due to the belief that the computational cost of this algorithm was (much) lower than the one from Eiben & Smith.
Later in the project, an analysis was made between order crossover and the proper edge crossover algorithm of Eiben & Smith. After some research, with a lot of contradictory advice, an arbitrary choice has been made to first try out the ‘proper’ edge crossover algorithm (Algorithm 4).
Implementation wise, quite a lot of effort has been made to catch all the corner cases of the algorithm, along with achieving relatively optimized code. The algorithm was kept a long time thereafter, until it was noticed that for larger problem sizes, crossover took an extremely long time (up to 95% of the total runtime was spend in the edge crossover operator).
Due to this slow execution time, order crossover [1] has been implemented as well (Algorithm 5).
This crossover algorithm is inherently much cheaper to calculate and takes only about 5% of the total execution time in the final algorithm. This is exactly the reason why this crossover operator was eventually used.
In hindsight, one reason for the slow execution time of the edge crossover operator is probably due to the usage of sets in the operator. The edge table was basically one list with sets, where a minus denoted a double entry. Sets were used because it was desirable to check quickly if an edge was present in the edge table. However, since for each element, maximum four edges could be present, lists would probably have sufficed. Given that also quite some bookkeeping was required with the sets, one more point can be made for using lists (e.g. deleting a positive entry if it occurred for the second time, to insert it afterwards with a minus in front of it).
Another reason for the big performance gap is the fact that the order crossover operator was able to use Numba for compiling the Python code and running it way faster, by using the decorator @jit(nopython=True)
. This because the order crossover operator only uses operations on Numpy arrays (which Numba handles perfectly well), while Numba threw hundreds of compile errors in the edge crossover implementation, because Numba (in the nopython=True
mode) couldn’t create new Numpy arrays, had difficulties with working on sets, and wasn’t able to infer the dtype
‘s most of the time.
1.8) Elimination operators
For a long time, the (κ + µ)-elimination operator was used in the algorithm. However, for the smaller problem sizes, it was noted that the population converged extremely quickly, even with the fitness promotion scheme present (as further discussed in Section 4.10). After some research, it became apparent that the (κ + µ)-elimination operator actually puts quite a lot of selective pressure. A k-tournament operator, in contrast, can mitigate this selective pressure, hence the (κ + µ)-elimination operator has been exchanged for the k-tournament operator.
To combine the k-tournament operator with the fitness sharing operator, the work is divided up. Algorithm 6 is for the k-tournament operator (along with some preparatory computations), while Algorithm 8 is invoked each time for the fitness sharing diversity promotion scheme itself, as explained later in Section 1.10.
After numerous experiments, a k-value of 8 has been chosen, as further discussed in Section 1.12.
1.9) Local search operators
The 2-opt local search operator has been implemented, which swaps every two possible edges in a given cycle. In a first version of this algorithm, the fitness was recalculated for every possible neighbor of the given individual, which entailed an unacceptable high computational cost, especially for the larger problem sizes. After some investigation, patterns were detected in the computation of the fitness. Hence, instead of recalculating the fitness for every neighbor, some kind of dynamic programming approach was undertaken. For every individual, there is a sort of preprocessing step, whereby so-called ‘cumulatives’ are created. These cumulatives capture the path length from the first city to that corresponding city in the cumulative array. The same process applies for the calculation of the path length from the last city to the corresponding city in the array (i.e. in reverse order, whereby the return cost of the last city to the first city is also incorporated). It is clear that the calculation of these cumulatives is done in O(N ), where N is the number of cities in the problem size.
Now, calculations of fitnesses of individuals are simply a matter of bookkeeping. The process is explained in Algorithm 7.
As further illustrated in Figure 2, the first part of the tour is simply the same as the previous iteration, with one extra cost added from first - 1
to first
. The same reasoning applies for the last part of the tour, where in this case the total cost decreases each time by the cost from second - 1
to second
. Finally, also the middle part can be build up in an analogous way. This way, the total cost of the 2-opt local search algorithm is only O(N²), where N denotes the total number of cities.
It should also be noted that by using Numba with the command @jit(nopython=True)
above the method declarations, the local search operator runs 745 times as fast. Numba can make these huge improvements due to the compilation of these methods, where especially the loops can be exploited.
1.10) Diversity promotion mechanisms
The used diversity promotion scheme is fitness sharing elimination, which changes the fitnesses of the individuals that are in the σ-neighborhood of the already chosen survivors. The fitness sharing elimination operator is explained in Algorithm 8.
The sub-method ‘distance from to’ calculates the distance between two individuals, measured by the number of common edges between the two individuals. To do this efficiently, the edges of each individual are calculated and stored in a set, at the moment of initialization. For measuring the distance between two individuals, the intersection between the sets is calculated.
Calculating the intersection of a lot of individual pairs over and over, turned out to be quite computationally expensive. An improvement that has been made is to store all the calculated distances in a hashmap, which gave a decent improvement, given that quite some individuals stay in the population for more than just one iteration. A precaution against thrashing has been taken, by simply emptying the hashmap if the memory usage of the system exceeds 95%.
1.11) Stopping criterion
Not a lot of effort has been put in implementing a stopping criterion, since all the larger problems stayed converging after running for five minutes. Given that even if the best fitness stayed for a very long time fixed, it occurred that due to a well chosen mutation/crossover operation, suddenly the algorithm can proceed even further. Hence, the stopping criterion is simply the time limit of five minutes.
1.12) Parameter selection
The population and offspring size have largely been determined by the computational cost from the algorithm. The largest computational cost per iteration for large problem sizes is the fitness sharing elimination step. This since the matrix with the number of common edges between all the individuals and the survivors grows quadratically, and the computation of one entry in this matrix also grows linearly. Hence, the computational costs grew too large if a lot more than 15 individuals in the population (and 15 more individuals as offsprings) were taken. Much less than 15 individuals is, naturally, also not desirable, since an evolutionary algorithm depends on having some diverse individuals.
The k-tournament parameters were determined by a hyperparameter search, where values ranging from 2 to 10 have been tried. Given that these two parameters are highly correlated, a grid-search or random search was required. To make the hyperparameter search feasible, a random search for these values was undertaken, which yielded a k-tournament value of selection equal to 5, and a value of 8 for elimination.
When the matrix with the number of common edges was printed, it became apparent that a lot of entries were either the maximal problem size, or zero. Hence, after some experimentation, a σ value of 50% of the problem size has been taken, with 0.25 as α value. This low alpha is in my opinion also better, given that really ‘close’ solutions should be penalized way more than solutions with still quite some different edges.
As a summary, the following hyperparameters were used:
- The population size = 15
- The offspring size = 15
- The k-tournament parameter of the selection operator = 5
- The k-tournament parameter of the elimination operator =
- The α-value of fitness-sharing = 0.25
- The σ-value of fitness-sharing = half of the problem size
1.13) Other considerations
As discussed in Section 4.10, quite a speedup was attained by storing the distances between individuals in a hashmap. For the same reason, a hashmap to store the fitness values of all the individuals has been introduced. Since in each iteration, it is known which individuals are removed from the population (if mutation is applied, local search yields a new individual, or individuals are killed in the elimination step), their value can easily be removed from the hashmap as well. This way, the size of the hashmap stayed always the same, which is more performant, since there is no restart after the memory level exceeds its threshold, and there is no time wasted for the garbage collector that needs to kick in.
2) Numerical experiments
2.1) Metadata
These experiments were conducted on an Intel Core i7–6700HQ CPU, with a clock frequency of 3.60GHz and 8 virtual cores. The systems contains 16 GB of main memory, and Python version 3.8 was used for the tests. The used hyperparameters are summarized in Section 1.12.
All convergence graphs were plotted semi-logarithmically, since the changes at the start are much bigger than the changes near the end of the algorithm.
2.2) Tour of 29 cities
The best tour found had a fitness of 27154.5.
As follows from Figure 3, the (probably optimal) fitness value of 27154.5 has been found each time. Further- more, the mean fitness possesses some variation due to the diversity promotion scheme. It is never equal to the best fitness value, which means that diversity is always preserved. Note that for this experiment, a time limit of 10 seconds per run was used to make the computation feasible.
The mean value of the thousand runs for the mean fitnesses is equal to 33551.6, with a value of 657.6 for the standard deviation.
As can be seen on the convergence graph of Figure 4a (see Section 2.6), convergence of the best fitness happens extremely quickly. The mean fitness, however, does not converge, which means that the population stays diverse due to the fitness promotion elimination scheme.
2.3) Tour of 100 cities
The best tour found had a fitness of 219948.4.
The convergence graph is shown in Figure 4b (see Section 2.6). The algorithm clearly performs great, by quickly jumping to the most interesting regions of the search space, after which the best fitness stays converging until the five minute time constraint is passed. As can be seen from the mean fitness, the population stays diverse and hence has the ability to keep exploring for better solutions.
For this tour, as well as for the next tours, it is difficult to make an estimate of how close this solution is to the most optimal solution, since only a heuristic value was known, which this algorithm clearly outperforms.
2.4) Tour of 500 cities
The best tour found had a fitness of 110814.2.
The convergence graph is shown in Figure 4c (see Section 2.6). The algorithm clearly performs great again, by again jumping to the most interesting search regions, combined with a smooth convergence. Furthermore, the population stays diverse.
2.5) Tour of 1000 cities
The best tour found had a fitness of 194955.2.
The best fitness again stays converging until the five minute time limit has been exceeded. Once more can be derived from the mean fitness that the population stays diverse.
2.6) Convergence graphs
3) Critical reflection
3.1) What are the three main strengths of evolutionary algorithms?
- By using a population-based metaheuristic, a tradeoff can be made between exploration and exploitation. This tradeoff turned out to be instrumental in finding good solutions for the travelling salesman problem. Pure random search is computationally way too expensive (it would even take an infinity before a ‘decent’, simple greedy solution would have been found). Pure local optimizers, on the other hand, would quickly converge to one suboptimal solution, which would most likely also be relatively far from the most optimal solution. Evolutionary algorithms provide a way to find good sub-optimal solutions in a decent amount of time.
- Evolutionary algorithms are in principle relatively easy to parallelize, since the population can explore the search space concurrently in many directions. Although in this project, not a lot of parallelization had been used (only in the initialization step), it would probably be the next most beneficial task to undertake. Trying to use multiprocessing has been undertaken, but the execution time when using multiprocessing was more than a thousand times slower, compared to simple sequential execution on one CPU core. This probably came due to a lot of interprocess communication (IPC), since for example the list of individuals was shared, which resulted in a lot of locks and semaphores, that clearly dominated the execution time.
- A last thing that struck me about evolutionary algorithms, is the fact that they in essence optimize a function without using derivatives (only by using the fitness values, leaving aside the local search operator for the moment). Especially for non-differentiable functions may an evolutionary algorithm offer salvation.
3.2) What are the main three weak points of evolutionary algorithms?
- Evolutionary algorithms work well, as long as they are carefully designed with reasonable parameters. For example, take a population size of 100 instead of 15, and the algorithm struggles to make some iterations. This of course due to the fitness sharing elimination, which has a cubic complexity in terms of the population (and offspring) size. Furthermore, a lot of other parameters have to be chosen to obtain reasonable performance. If for example the k-tournament parameters (from selection and elimination) are not considerably chosen, either the population converges immediately, or the overall fitnesses of the individuals hardly increase. This, in the end, comes at no surprise, since, according to the no-free-lunch theorem, there cannot exist any algorithm for solving all problems that is generally superior to any competitor.
- The representation used in this project clearly yields valuable offsprings after crossover. If one reasons about the representation, it is apparent that subvectors of the parents are probably good subsequences. When doing crossover, these features may now be combined in a more optimal way, so the offsprings have clearly a relatively high chance on having a low fitness value as well. However, for a lot of other problems, such representations and crossover operators are not that clear (i.e. features can’t be extracted from the parents), and hence evolutionary algorithms can’t offer a lot to these problems.
- Evolutionary algorithms are computationally very expensive. Since a lot of exploration is undertaken, inherently a lot of meaningless individuals are constructed while searching.
3.3) Main lessons learned from this project. Are evolutionary algorithms appropriate for the travelling salesman problem?
I think that the travelling salesman problem fits perfectly for the travelling salesman problem. Finding an exact solutions is of course infeasible, hence the usage of Metaheuristics becomes necessary, and they prove themselves to work extremely well. Considering that a lot of real-world applications require good approximations of the optimal solutions, it is obvious that these metaheuristics (and evolutionary algorithms in particular) are an essential tool to possess.
Furthermore, I learned to appreciate the value of libraries such as Numba, that can speed up Python code massively by compiling the code. Surely, not all operations are allowed, but given that the Numba community is evolving rapidly, I will keep an eye on it.
One kind of drawback of evolutionary algorithms are the way you have to test certain improvements. Due to the inherent randomness of evolutionary algorithms, quite a lot of variation can be observed between separate runs. Certainly if improvements of hyperparameters only yield a relatively small difference, the experiments have to be repeated several times before the best parameter becomes apparent.
References
[1] A.E. Eiben and J.E. Smith. Introduction to Evolutionary Computing. Springer International Publishing, 2 edition. [2] Nick Vannieuwenhoven. Local Search Operators. Technical report, 2021. [3] Darrell Whitley, Timothy Starkweather, and Daniel Shaner. The Traveling Salesman and Sequence Scheduling: Quality Solutions Using Genetic Edge Recombination. page 18.
The code of this project can be found here.
Unless otherwise noted, all images were created by the author.
If you have any remaining questions, or if you want to stay in touch, you can contact me via Github or Linkedin.