Genome Assembly using de Bruijn Graphs

Uditha Maduranga
Towards Data Science
5 min readJul 6, 2020

--

Source

After genome sequencing, we can figure out the order of DNA nucleotides, or bases, in a genome — the order of As, Cs, Gs, and Ts that make up an organism’s DNA. The whole genome can’t be sequenced all at once because available methods of DNA sequencing can only handle short stretches of DNA at a time. So instead, scientists must break the genome into small pieces (short reads), sequence the pieces, and then reassemble them in the proper order to arrive at the sequence of the whole genome.

Genome assembly can be described as a computational process of drawing together numerous short sequences called reads which were mentioned earlier, derived from different portions of the target DNA within the cell of an organism. This is an algorithm-driven automated process. DNA-sequence-assembly programs have utilized sequence overlaps for sequence assembly in the correct order.

figure 1- https://www.yourgenome.org/

There are many challenges in implementing genome assembly. One of the major challenges is dealing with erroneous reads. Whatever the sequencing technology that is been used, the reads will not be 100% accurate. To overcome this sequencing the same section of DNA a number of times over can be used. The number of times a section got sequenced is called coverage. More the coverage, more the confidence level would be. So having higher coverage results in lots and lots of short reads of the DNA sequence. These short reads are mixed up like small pieces of a puzzle. They should be organized together and put them in the correct order to assemble the genome sequence. So assembling the short reads to get to the original genome is an uphill task.

De novo Genome Assembly

These are a type of programs that assembles short nucleotide sequences into longer ones without the use of a reference genome. In this method of carrying out the assembly, it is assumed that there is no prior knowledge of the source DNA sequence length, layout or composition. The end goal of a sequence assembler is to produce long contiguous pieces of sequence (contigs) from the short reads. The contigs are sometimes then ordered and oriented in relation to one another to form scaffolds. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers. In greedy approach, it aims for local optima while in graph methods it aims for global optima. We will be talking about the latter one in this article.

The general de nevo assembly includes scaffolding and gap filling steps as well. Scaffolding is linking together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

figure 2- General workflow of de nevo assembly

De Bruijn graph Assemblers

Earlier an effective assembly method applied to first-generation sequencing data is the overlap–layout–consensus method which involves comparisons of all pairs of reads to identify overlaps. But this was not practical neither formidable to do the computations required to assemble these reads using first-generation methods.

In the 1940s, a Dutch mathematician called Nicolaas de Bruijn became interested in finding the shortest circular string of characters that contains all possible substrings, each of same length, in a given alphabet. The solution he came up with involved constructing a graph with all possible (k − 1)-mers as nodes. Each k-mer was an edge directed from node A to node B if the (k − 1)-mer in node A is a prefix, and that in node B, a suffix of the k-mer.

The answer to the stated problem now was to find a path through the graph that traverses each edge exactly once, or in other words Eulerian trail. Here is an example in which the sequence “ATGCTAGCAC” of the length 10, is assembled from five reads, each of length 6.

figure 3- Process of using de Bruijn assembler

Reads are broken into smaller fragments of a specified size k. In the above example, k corresponds to 3. k-mers are identified and a de Bruijn graph with (k–1)-mers as nodes and k-mers as edges drawn as described in the text. A Eulerian path is traced through this network resulting in the reconstruction of the original genome sequence.

The following steps can be specified in the process of reconstructing the original sequence from the given short reads.

Steps

  1. Take all (k-1)-mers from the set of k-mers, e.g. ATG, TGC-> AT, TG, GC. We should have gone past the size of k-mer reads.
  2. Construct a multi-graph with nodes being k-1-mers; draw an edge between two k-1 mers only if the two k-1 mers are taken from the same read. Ex: GCT & CTA
multigraph with GCT & CTA

3. Graph constructed this way is guaranteed to have a Eulerian trail, follow the trail and connect the nodes to form our original sequence. The graph similar to this will appear.

multigraph to reconstruct sequence ATGCTAGCAC

Now this algorithm can be used to assemble k-mer reads. We can loosen the condition to accept reads above a given length, and to break each k-mer to (k-n)-mers to account for reads that have k-n overlap, instead of k-1 overlap. Further, it is convenient to reconstruct the parts that are easier to assemble (contigs) and leave out parts that are ambiguous.

The following Jupyter Notebook can be used for the purpose of implementing the de bruijn assembler.

Jupyter Notebook for de bruijn assembler

It is quite true that assembly methods based on de Bruijn graphs begin, somewhat counter-intuitively, by replacing each read with the set of all-overlapping sequences of a shorter, fixed-length, but it is a popular method for genome assembly.

Stable and robust software packages, which are free to download, use and modify, have been developed to assemble genomes from short reads using de Bruijn graphs. A popular choice of software called Velvet appears to perform very well in assembling genomes (mostly on bacteria).

Although de bruijn assembler is a popular mean to implement assembling, there still exist some challenges for de bruijn genome assembly. Sequence error, uneven sequencing depth, repetitive sections and computational cost are few of the leading challenges.

References :

--

--

Bioinformatics Analyst | Computer Science and Engineering Graduate | AWS Community Builder | Former Software Engineer at Sysco LABS | maduranga95@gmail.com