The world’s leading publication for data science, AI, and ML professionals.

The Art of Science

Creating Insightful and Beautiful Bioinformatics Visualizations

Data Science

There’s a misconception that science is dry, humorless, and overly technical, but that doesn’t need to be the case! Effective visualizations can transform raw DNA into stunning figures that relay a lot of information with clarity.

Biopython is the go-to, open-source Bioinformatics, or computational molecular biology, package for Python. It provides custom classes and methods, parsing for standard biological file formats, and interfacing to other bioinformatics programs, such as BLAST, EMBOSS, or Clustalw. And for the purposes of this article, it can create amazing visuals using biological data.

We’ll start by diving in to Biopython’s methods for dealing with DNA, RNA, and protein sequences. Then we can use these methods for stepping up in complexity to whole genomes, and how those genomes change over time through phylogenetic tree visualizations. This process will allow you to explore any genomes you’re interested in and start drawing insights from raw data!


Sequences

Biopython has a Seq (Sequence) base class which resembles the base Python string class, but with additional methods to facilitate relevant biological use cases. For example, we can load in a DNA sequence.

>>> my_sequence = Seq('AGTCC')

We can use traditional string methods with this object, like iterating over each letter in the string or using len() to get the length of the string.

# Print length
>>> print(len(my_sequence))
5
# Print each basepair in the sequence
>>> for basepair in my_sequence:
>>>     print(basepair)
A
G
T
C
C

We can also select basepairs by their index and slice the Sequence as normal. Slicing will return a new Seq() object, which can be stored in a new variable or overwrite the original sequence.

# Select basepair by index
>>> my_sequence[2]
'T'
# Slice sequence (returns a new Seq() object)
>>> my_sequence[1:4]
Seq('GTC')

Now we will look at some more interesting and biologically useful methods in the Seq() class. We can easily generate the complement and reverse complement of the DNA sequence, and calculate the proportion of G and C basepairs in the sequence.

Complement

When DNA is replicated, each strand is separated and is used to build the other strand. Adenine (A) and thymine (T) are complementary, and guanine (G) and cytosine (C) are complementary. So a sequence of ‘ACTG’ will have a complementary sequence ‘TGAC’. In RNA, the base uracil (U) is used instead of thymine (T), so a DNA sequence of ‘ACTG’ read during transcription will yield an RNA sequence of ‘UGAC’.

>>> my_sequence = Seq('AGTCC')
# Generate complement sequence 
>>> my_sequence.complement()
Seq('TCAGG')

Reverse Complement

During transcription, the DNA is actually read from the template strand running 3′ →5′ and the reverse complement gives the mRNA sequence. However, we generally analyze DNA from the coding strand running 5′ →3′.

While the biological process is actually two-steps, first take the reverse complement of the template strand then transcribe into RNA. We can streamline the process by working from the coding strand and substituting the T basepairs for U. Either solution results in an mRNA sequence.

# Generate reverse complement sequence
>>> my_sequence.reverse_complement()
Seq('GGACT')
# Two step solution from template strand
>>> coding_strand = Seq('ATGTAG')
>>> template_strand = coding_strand.reverse_complement()
>>> template_strand.reverse_complement().transcribe()
Seq('AUGUAG')
# From coding strand
>>> coding_strand = Seq('ATGTAG')
>>> coding_strand.transcribe()
Seq('AUGUAG')

GC-content

Biopython is also able to calculate the ratio of GC bases to AT (or U) bases in a DNA or RNA sequence. This is biologically relevant because the GC-content varies among species, therefore it can be used to identify unknown genomes against known species’ GC-content.

Within a genome, coding regions (often genes) generally have a higher GC-content than non-coding regions. Thus, the GC-content can be used to identify genes in an unknown genome.

# Calculate GC proportion
>>> from Bio.SeqUtils import GC
>>> GC(my_sequence)
60.0

Translation

Now that we’ve generated an mRNA from our DNA, we need to translate to an actual protein sequence! Much like transcription, Biopython offers several ways to achieve this, either in the biologically correct method or more efficiently from the coding strand DNA.

# In the biologically correct way after creating the mRNA sequence
>>> mRNA = Seq('AUGGCCAUUGUA')
>>> mRNA.translate()
Seq('MAIV')
# From the coding strand directly
>>> coding_strand = Seq('ATGGCCATTGTA')
>>> coding_strand.translate()
Seq('MAIV')

It’s important to note that different taxonomic groups require different tables for correctly synthesizing protein sequences from mRNA. See here for NCBI resources on genetic codes. A keyword argument can be specified in the .translate() method – .translate(table = "Vertebrate Mitochondrial"), for example.


Genomic Data

Now that we have an understanding of how sequences work in Biopython, let’s move to a higher level and try to visualize whole genomes. There’s a module called GenomeDiagram, specifically designed for visualizing genomes in a linear or circular diagram (for most prokaryotes).

Biopython has several example genomes that can be downloaded from GitHub. Genomes of interest can also be downloaded from GenBank, the NIH genetic sequence database, in a variety of file formats (such as .gbk or .fasta).

# Import Libraries
from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.Graphics import GenomeDiagram
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
# Read in our genome
record = SeqIO.read("NC_005816.gb", "genbank")
gd_diagram = GenomeDiagram.Diagram(record.id)
gd_track_for_features = gd_diagram.new_track(1, name="Annotated Features")
gd_feature_set = gd_track_for_features.new_set()
# We can color code every other gene for clarity
for feature in record.features:
    if feature.type != "gene":
        # Exclude this feature
        continue
    if len(gd_feature_set) % 2 == 0:
        color = colors.blue
    else:
        color = colors.lightblue
    gd_feature_set.add_feature(
        feature, sigil="ARROW", color=color, label=True, label_size=14, label_angle=0)
# We can add in restriction sites for some popular enzymes 
for site, name, color in [
    ("GAATTC", "EcoRI", colors.green),
    ("CCCGGG", "SmaI", colors.orange),
    ("AAGCTT", "HindIII", colors.red),
    ("GGATCC", "BamHI", colors.purple)]:
    index = 0
    while True:
        index = record.seq.find(site, start=index)
        if index == -1:
            break
        feature = SeqFeature(FeatureLocation(index, index + len(site)))
        gd_feature_set.add_feature(
            feature,
            color=color,
            name=name,
            label=True,
            label_size=10,
            label_color=color)
        index += len(site)
# Create our diagram!
gd_diagram.draw(
    format="circular",
    circular=True,
    pagesize=(20 * cm, 20 * cm),
    start=0,
    end=len(record),
    circle_core=0.5)
# Can save to any standard photo format
gd_diagram.write("plasmid_circular_nice1.png", "PNG")

Now we can start making some beautiful visualizations! Here we see the genome for Yersinia pestis biovar Microtus, an avirulent strain of the bubonic plague, developed as a vaccine. Genes are shown in blue and light blue, and restriction enzyme sites are noted by the colored lines.

Image by Author
Image by Author

Visualizing a more complex genome, a thale cress (Arabidopsis thaliana) chloroplast, shows many more genes and dozens more EcoRI restriction sites. This genome is 154,478 basepairs, compared to just 9609 basepairs above.

Image Courtesy of Brona on Wikipedia under a CC BY-SA 3.0 license.
Image Courtesy of Brona on Wikipedia under a CC BY-SA 3.0 license.

This is a thale cress. It is a model organism in genetics. So we can specify the genome from GenBank using the same base code as above to generate a circular DNA sequence.

# Specify the thale cress chloroplast genome
record = SeqIO.read("NC_000932.gb", "genbank")
...
# Only including the EcoRI sites (green)
for site, name, color in [("GAATTC", "EcoRI", colors.green)]:

    ...
Image by Author
Image by Author

This is about as complex as a genome can be to present information with any sort of clarity. But, it’s also quite visually striking. Indeed, if we go back and add in more restriction enzyme sites, like we did in the first genome, we get an ever more beautiful visual.

...
for site, name, color in [
    ("GAATTC", "EcoRI", colors.green),
    ("CCCGGG", "SmaI", colors.orange),
    ("AAGCTT", "HindIII", colors.red),
    ("GGATCC", "BamHI", colors.purple)]:
...
Image by Author
Image by Author

Phylogeny

Now that we’ve work from raw DNA sequences to whole genomes, we know that genomes aren’t as stable as we’d like. Mutations are frequent, especially in viruses. A major difficulty in developing vaccines for viruses is their frequently changing genome.

We can track how variants arise over time by comparing novel sequences to a base sequence. For the SARS-CoV-2 (Covid-19) virus, the base sequence is the initial strain from Wuhan, China.

Using another tool for visualizing phylogeny, Nextstrain has several packages that allow public health officials and researchers to analyze new strains of Covid-19 in the local community. You can then build animated visuals for a variety of features, for example the number of mutations, country of origin, genotype, clades, etc.

You can follow this guide for generating your own Covid-19 phylogeny. You’ll want to create a new conda environment for Nextstrain (covered in the guide) because you’ll be downloading a lot of dependencies.

SARS-cov-2 phylogeny from January to May. Generated with Nextstrain's Auspice visualization tool. GIF by Author.
SARS-cov-2 phylogeny from January to May. Generated with Nextstrain’s Auspice visualization tool. GIF by Author.

Conclusion

Biopython is a fundamental package for bioinformatics in Python. I hope this post has given a taste of what it can do to analyze sequences, transcribe and translate DNA → mRNA → protein sequence, visualize whole linear and circular genomes, and visualizing chromosomes. I’ll likely be posting additional articles utilizing Biopython in the future.

Biopython has extensive documentation and is able to take advantage of NCBI resources like whole genomes, variant strain genomes, and chromosomes through FASTA files. So play around with genomes of interest and see what you can make! You might just end up with an effective and stunning visualization.


Connect

I’m always looking to connect and explore other projects! You can follow me on GitHub or LinkedIn, and check out my other stories on Medium. I also have a Twitter!


Related Articles