The world’s leading publication for data science, AI, and ML professionals.

Eulerian Path’s and DNA sequencing

How a trivial curiosity like the bridges of Konigsberg helped modern scientists sequence DNA!

Königsberg Cathedral, the bridge on the right is one of the two surviving bridges from Euler's time. | Image by Gumerov Ildar
Königsberg Cathedral, the bridge on the right is one of the two surviving bridges from Euler’s time. | Image by Gumerov Ildar

Sequencing DNA is a massive part of modern research. It enables a multitude of different areas to progress, including genetics, meta-genetics and phylogenetics. Without the ability to sequence and assemble DNA into genomes, the modern world would have a much looser grasp on disease, its evolution and adaptations, and even our history/phylogeny.


Before discussing how Euler optimising his morning jog helped us make considerable advances in this space, it’s essential to lay some foundations on how sequencing works and the optimisation problem.

For this story, I will be using the word DNA to describe the genetic code and genes as the target of the sequencing, but keep in mind, similar processes are applied to sequencing other things like RNA. This description of how sequencing works is not meant for bio-informaticians or biologists doing this on a day to day basis. This description is intended for my fellow data scientists who want to understand the process and its issues, and as such, will be a very high-level overview. If you are a Biologist/Bioinformatician, you may still be interested in reading this to get a data scientist’s point of view, but it may be worth skipping over the next section.


DNA Sequencing involves extracting and reading the strands of DNA in a sample. I mean, it sounds simple enough? But how does one go about performing this task? The following is a high-level overview of how an Illumina sequencing machine works.

DNA Bases - image by author
DNA Bases – image by author

DNA is a double-stranded string of nucleic bases. These bases are labelled ACGT due to their names starting with those letters. In the case of RNA, the T is uracil and is denoted with the letter U. Each of these bases exclusively bind to only one other base, the A binds with T, and the G binds with C. These bindings of two bases are called base pairs. The long twisted strings of these base pairs that we refer to as "DNA", and as such, the goal of sequencing is to read what these bases are. Since they exclusively bind to only one other pair, we don’t have to read both strands of DNA to sequence it; we only need to read one strand as the other is deterministic.

The first step of sequencing is taking the extracted DNA (this part is done using chemical/physical processes) and break the DNA into smaller pieces using ultrasonic waves. These smaller pieces are then cleaved into single strands and have small proteins attached to their heads and tails. These heads and tails are specially designed proteins that bind with similar matching proteins on a glass slide.

There is then a process of replication in which specific enzymes/proteins build the matching base pairs back onto every single strand making dual strands again. These double strands are then split apart back into single strands and to process is repeated a few times, each time increasing the number of single strands.

At this point, we are ready to start sequencing the actual DNA! The physical read of the DNA strand is facilitated by unique bases which have a fluorescent compound added to their chemical structure. These are added one at a time to the clusters from the previous step to bind to the next available base on the DNA strands. A laser is used to excite the fluorescent component, and there is a detectable flash of light at the given wavelength for that base from the cluster. This process is repeated until all bases are paired. At this point, the reading of the DNA is completed.

The sequencer can get an idea about the quality of the reads using these flashes and the intensity/wavelengths of the emitted light since any inconsistencies in the DNA strands will result in different coloured flashes. Ideally, at this point, we want to see intense emissions at a single wavelength to show that all the strands in the cluster are binding to the same pair so we can be sure that this read was accurate.

This process uses a unique glass slide called a cytometer and a detector inside the sequencing machine. A cytometer for this article can be thought of as a glass slide with a grid layout. The detector can be considered as a particular type of digital camera. The grid layout corresponds to the detector’s pixels, which results in each cluster of DNA being sequenced getting its own set of pixels to be extracted from the image of the detector. These images are processed to extract each base pair in the sequence and the quality of that base pair.

At this point, we have vast amounts of data in the format of smaller individual reads. These reads are of small subsections of the DNA strands; they must be assembled back into larger chunks called contigs (short for contiguous); the contigs are then assembled back into the target genes.


So how do we assemble these reads back into the more extensive sequences?

Sequence Reads - image by author
Sequence Reads – image by author

It’s a simple enough concept. We just overlay the short reads and look for overlap, where they overlap for a significant enough length, then we assume that they are contiguous lengths of DNA. This process is done repetitively, assuming there are enough reads and those reads are of high enough quality; eventually, the process results in correctly reconstructing the genome in question. This brute force way would take weeks to assemble even a simple single genome. There must be a better way.


Putting together these short reads into a target sequence is not as difficult as it sounds! There have been a few crucial developments over the years in this space. The first significant development in solving this problem was treating the construction of the reads into contiguous sequences as a travelling salesman problem. To accomplish this, a graph is constructed from the individual reads. These represent each read as a node, and each edge represents an overlap between the connected nodes (reads). Then, all we have to solve is a walk over the given graph that visits all nodes exactly once! This process of visiting all nodes exactly once is formally referred to as a Hamiltonian walk. There are many methods to solve a Hamiltonian walk, and many are quite fun to implement.


I will take this opportunity to go on a quick tangent here and talk about a unique instance of the Hamiltonian walk problem. It is referred to as the travelling salesman problem.

Consider a graph network where nodes represent cities, and the edges between cities represent the distances. The idea is to visit each city only once and travel the least amount of distance. My favourite way to solve the travelling salesman problem. Genetic Algorithms are a fun option for solving Hamiltonian walks! Genetic algorithms are a class of machine learning model that simulate a population of possible solutions to the problem, and optimises them by "breeding" and "mutating" them over many generations. The notebook on Kaggle that I have linked below was from an old competition whereby participants were asked to optimise Santa’s route to distribute presents. This question essentially boiled down to an aptly themed for the time of the year travelling salesman problem. To summarise the code in this notebook, a population of solutions are constructed. This is an ordered list of nodes chosen at random without replacement. The distance between each node is summed by stepping through this list to produce a final fitness score for that individual. The best individuals are then copied to the next generation, along with mutation versions of their solutions. Some of the individual nodes visited are swapped around to produce a new list of nodes visited. Through this selective promotion to the next generation and mutation, the algorithm eventually converges on a solution.

Genetic Reindeer ftw


Returning to the central theme of this article, the unfortunate truth of Hamiltonian walks is that all methods to solve are computationally intensive. For example, a brute force approach would need to evaluate at worse N! possible Hamiltonian paths where N represents the number of nodes in the graph network. You can see how this quickly gets out of hand as the number of nodes in the graph increases.

Is there a better way to reconstruct this graph to give us a more simple problem to solve? Yes! There is a way of reframing this problem that results in a problem with a proven and much less expensive solution. What if we took the overlaps and represented them as the nodes and took the reads and represented them as the edges? This beautiful and simple rearrangement of the graph network gives us a new problem to solve. Instead of visiting each node only once, we now want to visit each edge only once. The task of arranging the short DNA reads into longer contiguous sections is now represented not as solving a Hamiltonian walk but as solving an Eulerian path!

Seven Bridges of Königsberg – Wikipedia

Euler classically defined an Eulerian path in 1736 as they proved the seven bridges of Königsberg problem was unsolvable. The problem is stated as: Is it possible to walk all seven bridges of Königsberg only once starting from anywhere? Euler struggled to solve this, and try as he might, he couldn’t find a solution, eventually deciding it was impossible. However, being the rigorous mathematician they were, simply declaring a problem unsolvable was not good enough! Euler devised a mathematical proof by expressing the situation as a graph network. This proof essentially boiled down to the following statement (when talking about an undirected graph):

An Eulerian path is only solvable if the graph is Eulerian, meaning that it has either zero or two nodes with an odd number of edges.

Intuitively, the above statement can be thought of as the following. If you enter a node via an edge and leave via another edge, all nodes need an even number of edges. Extending upon this line of thought, there are two special cases when a node can have an odd number of edges, the node you start the walk on and the node you finish the walk on!

Eulerian paths can be solved in linear time using Hierholzer’s algorithm! This is a vast improvement over the Hamiltonian walk, and implementation of the algorithm is much simpler! As this article is already getting very long, I won’t go over Hierholzer’s algorithm’s individual steps. I would, however, like to leave this as a small game for the user!

Can you work out what this algorithm might be without looking it up (I have provided a link to a description of the algorithm below)? I drew similar comparisons from when learning about this algorithm was a particular solution to finding a path through a maze.

Hint: Assuming there is a solution to a given maze, holding your hand on the left (or right) wall and walking continuously will eventually result in you finding a solution the maze.

Eulerian path – Wikipedia

I had a lot of fun learning about this, and I hope this article has piqued your interest to look further into this fascinating area of Data Science! Let me know in the comments about other exciting applications of graph networks!


Related Articles