
The Travelling Salesman Problem (TSP) is one of the classic challenges of combinatorial optimization. Although it’s been studied for a long, currently no exact method is known to guarantee the optimal solution. With the advancements in the field of artificial intelligence, new proposals to tackle the TSP have been born. In this article, we leverage a transformer neural network to figure out good solutions for this problem.
Code repository: GitHub
The aforementioned codebase is a refactoring of a previous project I accomplished to pass the Automated Decision Making course exam during my MCS. I want to thank my friend and colleague Alessandro, who collaborated on that.
Background – TSP
Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city? (source: Wikipedia)
The above question defines the Travelling Salesman Problem (TSP), one of the NP-hard dilemmas, meaning that no currently known method leads to the optimal solution in a polynomial time. A little more formally, given a graph G(V, E), where V is the set of nodes (the cities) and E is the set of edges connecting any pair of nodes in V, we are asked to find the shortest possible Hamiltonian tour (i.e. a "route that visits each city exactly once and returns to the origin city") in G.
In this project, we are going to consider a node as a simple pair of features, namely the x and y positions in a 2d vector space. Consequently, there exists an edge between any pair of nodes (u, v), whose weight is given by the Euclidean distance between u and v.
Background – Transformer for TSP
We take inspiration from the work of Bresson et al. [1], who leverage a transformer neural network to address the TSP, obtaining very promising results. Briefly, their model is split into two components. First, the transformer encoder takes as input a set of tokens, i.e. vectors representing the nodes in an input graph G, and transforms them into a new latent space thanks to the self-attention layers. We’ll refer to the output of the encoder as graph embeddings. Subsequently, the transformer decoder is fed with a fictitious token z, which signals the model to start building the tour, plus the graph embeddings from the encoder, and generates a probability distribution modeling the likelihood for each node in the graph to be selected as the next candidate in the tour. We draw the index t corresponding to the chosen node from that distribution. Then, we concatenate z with the t-th graph embedding and we again feed the decoder with the resulting sequence, that represents the partial tour. This process goes on until all the available nodes have been selected. Once finished, we remove z from the obtained sequence and append to it the first selected node, thus closing the tour. Such a decoder is called auto-regressive, since at each selection step it works on its own output from the previous step. The model is trained with the REINFORCE [2] algorithm to minimize the average length of the tours generated from random graphs with 50 nodes each.
Proposed method

In this project, we too employ a transformer architecture, since we aim to exploit attention to learn valuable global dependencies among the nodes in a graph. At the same time, we want to design a model avoiding any auto-regressive component, i.e. a neural network that directly outputs a feasible TSP solution, without the need for a single forward pass for each node.
We start by considering that the TSP can be framed as a set-to-sequence problem. We have a bunch of nodes for which an order is not defined, and we want to figure out how to sort them so that the resulting tour is the shortest possible. Representing the concept of order in a neural network is usually done with some sort of positional encoding [3]. That is, we sum each token in an input set with a particular vector, whose components are unique for a specific position. So, if we change the order of tokens, the positional vector will be different, resulting in a diverse token as well. Naturally, there are tons of possible implementations for positional encoding, but to keep things simple, we stick with the original proposal.
The core of our method is the attention operator, specifically the cross-attention, so here we provide a brief recap about it.

The critical step is the second. A is a [n,n] matrix, whose (i,j) item is a real number proportional to the cosine similarity of the i-th token w.r.t. j-th token. To understand why, recall from step-2 that (i,j) is the outcome of the dot-product between query i and key j, both being two vectors of shape _(dk). But, the dot-product is nothing more than the cosine of the angle amid two vectors, apart from a scaling factor represented by the product of the norm of the very same vectors.

After applying the softmax (step-3) to A (actually, to a scaled version of A, but that’s not the important stuff), we obtain a matrix whose i-th row is a probability distribution, modeling the likelihood of the query i to be similar to the key j=1,…,n. Finally, A applies a linear transformation to V, weighting its d=1,…k features by the similarity among queries and keys. For instance, let’s say that _x_0, x_1, and x_2 are tokens that stand for the cities of New York (NY), Washington DC (WDC), and Los Angeles (LA) respectively. Their features are related to the geographical location of the 3 metropolises. We extract the 3 matrices of Q, K, and V from them and we compute the attention matrix A. Since the distance NY-WDC is way shorter than NY-LA, the first row of A will have a very high value in its first column (because the distance NY-NY is naturally the shortest), a mid-high value in the second column, and a low value in the third one. NY is also represented by the first row in the matrix V_, i.e. _v0. Well, after the AV matrix multiplication, _v0 will have its features weighted by the first row (NY) of A mostly, by the second row (WDC) reasonably, and by the third row (LA) poorly. The same process holds for the other two cities. Our original set of tokens has been transformed into a new set of vectors, where their features are weighted by (spatial, in this case) similarity.
This type of attention is called self-attention since Q, K, and V all come from the same input. On the other hand, in a transformer decoder, there is also a second type of attention, cross-attention. This time, the query Q comes from the decoder itself. For instance, in the model of Bresson et al., Q represents the partial tour. K and V are conversely linear projections of the output from the encoder.
Well, in this work, we proposed to leverage cross-attention differently. Indeed, we use the positional encoding as a query, while keys and values are extracted from a previous self-attention layer, which is free to operate over all the nodes of G. This way when we compute the attention matrix, we end up with a matrix where each row is a probability distribution of the likelihood of a given position to be similar to a given node. The following picture may help in understanding this claim.

As per the above attention matrix, node #1 will likely be the first in the proposed tour. #46 is probably the second. Nodes #39 and #36 will be placed in the third and fourth positions respectively. Node #23 is a good candidate for both the fifth and sixth positions, and so on… In practice, the element (i,j) of such a matrix tells the likelihood for node j to be inserted in position i in the proposed tour. Our model is essentially a stack of blocks, each one consisting of a self-attention layer, the presented cross-attention, and finally a feed-forward network, as in any transformer. The attention matrix from the last cross-attention layer will be used to generate the tour. A full attention matrix produced by our model is something like this one:

Which is apparently quite chaotic. After optimizing the network with the REINFORCE algorithm to minimize the average tour length, we end up with matrices like the following.

Starting from the above matrix, we can draw a node for each row, and the resulting sequence will be our tour (after appending the first node to close it). For the sake of simplicity, in all our experiments we go with greedy decoding to select nodes, i.e. we always take the node with the highest value for the current row. We leave to future developments the exploration of other strategies such as the beam search algorithm.
Feasible solutions
You’ve probably already figured out the main problem of our method. To build a feasible tour, we need to choose each node only once. Conversely, if you carefully observe the previous picture, you’ll notice that there are some repetitions. Namely, node #47 is the most likely one for two consecutive positions, as well as node #32. As it is, our network is not able to provide feasible TSP solutions. What we would like to obtain from the model is a permutation matrix. i.e. a matrix with a single entry of value 1 for each row and column. The issue is that Neural Networks are not good at predicting such sparse objects.
To overcome this hitch, we can formulate the TSP as a linear sum assignment (LSA) problem, whose (inverse) cost matrix is the attention matrix computed by our transformer. In this work, we leverage the implementation from SciPy to find the minimum cost matching between the set of nodes (the workers in the typical LSA formulation) and the set of positions (the jobs) in the tour.
Sinkhorn operator
Naturally, the solution of LSA with costs given by a permutation matrix is straightforward: you just have to select the 1s for each row of the matrix. The objective of our training is thus to make the final attention matrix as close as possible to a permutation one, such that the LSA will be easy and will lead to a matching representing a tour with a short total length.
In our project, we leverage the Sinkhorn operator [4] to turn the dense attention matrices of our models into soft-permutation ones: they don’t truly represent a permutation, but they get very close to it. In the following, I show the course of a typical training setup. The average tour length of the solutions proposed by our model is expected to decrease while training. This doesn’t happen if we stop applying Sinkhorn before solving the LSA problem.

Results and comparisons
We compare our proposal against the algorithm of Christofides [5], a popular heuristic for the TSP, as well as the auto-regressive transformer from Bresson et al., which we take as a reference also concerning the training setup. The evaluation is carried out over 10000 graphs with 50 nodes each, whose coordinates are randomly generated.
Here we provide a box plot depicting the length of the solutions proposed by the three contenders. Undoubtedly, our transformer is the worst of the context. Producing good TSP solutions leveraging a neural network without exploiting auto-regression is a hard task.

However, solving optimization problems is not just about performance. Computational time plays a fundamental role as well. That is the main feature of our proposal: since our transformer requires a single forward pass to provide a solution, we are faster than the method of Bresson et al., as they have to loop as many times as there are nodes in the graph to get a complete tour. The next plot shows the computational time, measured with the Python profiler. For Christofides, we use the implementation from NetworkX. Note that, for a fair comparison, the two neural networks run on CPU since Christofides is not thought to run on GPU.

Finally, we show some qualitative results of the best tours produced by our network. In these cases, our model was able to improve over the solutions generated by Christofides.










Conclusions and future work
In this work, we’ve presented an original neural network to tackle the TSP. While we manage to design a suitable architecture that avoids auto-regression, the quality of the proposed tours is questionable. The main drawback of this approach is that it is not an end-to-end neural method, since we need to solve an LSA instance to ensure a feasible solution. We tried to force the model to avoid node repetitions, by adding some penalty terms to the loss function, but we didn’t get any good results. To learn a permutation from a set is a hard task on its own.
Probably, just switching from mere greedy decoding to a beam search strategy is sufficient to slightly improve results. Furthermore, there exist more advanced reinforcement learning algorithms to explore, such as PPO. We hope that others may be inspired by this project and will continue to explore this pathway.
References
[1] Bresson, Xavier, and Thomas Laurent. "The transformer network for the traveling salesman problem." arXiv preprint arXiv:2103.03012 (2021)
[2] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Reinforcement learning (1992): 5–32
[3] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017)
[4] Adams, Ryan Prescott, and Richard S. Zemel. "Ranking via sinkhorn propagation." arXiv preprint arXiv:1106.1925 (2011)
[5] Christofides, Nicos. Worst-case analysis of a new heuristic for the Travelling Salesman problem. Carnegie-Mellon Univ Pittsburgh Pa Management Sciences Research Group, 1976