Chaos Game Representation of Genetic Sequences

Building GUI for Chaos Game Representation for Genomic Sequence and vectorizing C.G.R for measuring closeness and application of clustering algorithms.

Rohith Ramakrishnan
Towards Data Science

--

Photo by National Cancer Institute on Unsplash

In this article, we will be dwelling upon the application of C.G.R in Bio-informatics. The Chaos Matrix will be then converted into an N-Dimensional Vector for measuring closeness and application of clustering algorithms. A GUI will also be built for representing a C.G.R plot for a family of sequences.

Chaos game representation was a novel method introduced in 1990 by H J Jeffery. It is an iterative methodology for conversion on One Dimensional Text Sequence into a visual Two Dimensional matrix, essentially the application of non-random input to an iterated function system. Initially introduced for Genomic sequences, now it is being used for any arbitrary symbol or alphabet containing ASCII characters from code 32 through 126, which consists of the alphabet in upper and lower case and punctuation.

Genomic Sequences are often Gigabytes worth permutations of text represented via A, T, G, and C. Numerous methodologies over the years have been developed for comparing sequences, which are complex to the human eye. C.G.R provides a visual 2D plot that can be instantly acknowledged and compared with other Sequence’s C.G.R plots. C.G.R exhibits striking pattern characteristics of the genome. Since the patterns are unique to genomes this can be used to identify genome fragments or to detect horizontal gene transfers. In addition, quantitative analysis of C.G.R patterns can be used for sequence comparison both alignment-free and alignment-based sequence comparisons are possible with C.G.R.

Algorithm Implementation

In this section, we will be diving into how we could create a C.G.R plot for any given sequence from scratch in MATLAB and Python. I had come across two different methods for the iterative algorithm and both of these will be implemented in either one of the languages. The base for both algorithms remains the same,

  • Start from the origin of the two-dimensional coordinate systems, usually (0,0)
  • Each Nucleotide has its predefined location:

A: (-1, 1) — T: (1, -1) — G: (1, 1) — C: (-1, -1)

  • For each nucleotide (in reverse order to match k-mer table; i.e., match forward Markov chain), move and mark the new location which is halfway between the current location and the nucleotide. For example, if the last letter is T, the position is moved from (0,0) to midpoint between (1, -1) and (0,0), which is (0.5, -0.5)
  • Repeat this procedure for all nucleotides

The above-mentioned implementation depends on the k-mers generated and will vary with the length of k-mers chosen.

Chaos Game Representation

The above code tested with Human betacoronavirus fasta files with a k value of 9 (k value depicts the length of each k-mer) yielded these plots.

C.G.R Plots for the betacoronavirus

The second implementation done in MATLAB is independent of the k-mers and solely depends on each nucleotide.

Hoang,T., Yin, C., & Yau, S. S. T. (2016). Numerical encoding of DNA sequences by Chaos Game Representation with application in similarity comparison. Genomics, Vol 107, 2016, Elsevier Inc. Journal of Theoretical Biology

The above code tested with Human betacoronavirusfasta’ files yielded these plots.

Chaos Game Plots of two coronavirus sequences

GUI Implementation

Photo by Sigmund on Unsplash

The implemented python file will be the skeleton for the GUI. Streamlit is an open-source app framework for Machine Learning and Data Science teams. Create beautiful data apps in hours, not weeks. All in pure Python.

The link to the datasets used can be found here.

The given dataset consists of numerous families of sequences in ‘xlsx’ format, each of these families consist of numerous sequences that belong to the respective family.

Peek into one of the datasets.

The GUI allows the user to choose any two families from the dataset and choose a sequence among these families which will be converted into C.G.R plots.

Two different methods were adopted for finding similarities between two sequences, Vectorial Correlation and Image Similarity methods. For vectorial correlation, the given 2D chaos matrix was converted into a 1D vector via Linear Combinations of rows. These vectors were correlated via Spearman, Kendal, and Pearson's methods.

SSIM and PSNR indices were used for computing similarities between Images. The computed C.G.R 2D plots will be fed into these methods. SSIM values range from 0 to 1, higher the score similar the images are. PSNR value depicts the signal to noise ratio hence lesser the value closer the images. These methodologies were a part of series of experiments performed but from observations, such methods did not yield a very useful result.

The packages used in python for SSIM are: sckit-ssim & SSIM-PIL

The source code for the implementation can be found here.

Link to the GUI: https://share.streamlit.io/rohith-2/chaos-game-representation_bioseq/stream.py

The Final result of the proposed GUI

Vectorising C.G.R matrices into N-Dimensional Tuples

In an attempt to extend the application of C.G.R to cutting edge ML and Deep Learning models, a methodology published in Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison can be utilized. The objective is to cluster N-Dimensional tuples belonging to the same family to find the ‘distance’ between any two sequences and to identify the family of an unknown genomic sequence.

Vectoriser.m

The proposed idea was attempted by taking 3 genomic families and 20 Sequences in each of these families. For visualization purposes, the tuple dimension was chosen as 3.

Different Views of 3D Plots

Future Work:

This is a basic implementation of the C.G.R, this isn’t optimized for speed or better depiction of higher k-values. The Vectorial Depiction can be utilized at higher dimensions better features which can improve numerous ML model classifications and pattern matching of Genomic Sequences.

Github: https://github.com/Rohith-2/Chaos-Game-Representation_BioSeq

The datasets used in the article were obtained from NCBI.

Co-Authors : anirudh bhaskar and srikanth ankam

References:

  • Hoang,T., Yin, C., & Yau, S. S. T. (2016). Numerical encoding of DNA sequences by Chaos Game Representation with application in similarity comparison. Genomics, Vol 107, 2016, Elsevier Inc. Journal of Theoretical Biology.
  • H. Joel Jeffrey, “Chaos Game Representation of gene structure”, in collection Nucleic Acid Research, Vol. 18, № 8, 1990, pages 2163–2170.

--

--