The world’s leading publication for data science, AI, and ML professionals.

Writing Useful Functions to Explore Human Genetics with Python

Simple Python Functions can help give an insight into Human Disease

Image Courtesy of National Cancer Institute Via Unsplash
Image Courtesy of National Cancer Institute Via Unsplash

A Python and Sequence Data Example

Many human hereditary neuro-degenerative disorders such as Huntington’s disease (HD) are correlated with the expansion in the number of tri-nucleotide repeats in particular genes. In Genetics, codons, a type of tri-nucleotide repeat, code for amino acids, the building blocks of proteins. Specifically, in Huntington’s disease, the pathological severity is associated with the number of CAA or CAG codon repeats. These codons specify the amino acid glutamine, one of 20 of the small building blocks. As such, HD belongs to a group of diseases collectively referred to as polyglutamine (polyQ) diseases.

More than 35 tandem repeats virtually assures the disease, while the healthy variant of this gene have between ~6–35 tandem repeats.

Fortunately, Python can be used to write simple functions to investigate the mRNA transcripts, the instruction template needed to assemble proteins, and determine the number of tandem repeats of either the CAA/CAG codons.

Aim:

We are fortunate in biology that a lot of the files we may be interested in, from databases such as NCBI are text-based files. This means we can treat them as ordinary Python strings.

For the function that I will demonstrate here, I would like the output to be as informative as possible.

Specifically, I would like to analyze a typical FASTA file of the mRNA transcript which encodes the huntingtin protein. In genetics, a FASTA file is a simple file with a header that contains a unique identifier for the sequence, and the sequence itself below this header, as shown below:

FASTAheader

AGCTCGATACGAGA

I would like to retrieve and work out the Accession or identifier for that file, the DNA length of the sequence. Importantly, I would like the user to choose their own tandem repeat number for the CAA/CAG codons to search for in the file, and have the output inform the user of the number of consecutive runs of CAA/CAG codons above the input tandem number, what those tandem sequences are, and how many repeats were found in each run.

Getting started

To begin lets practice on a small DNA sequence. I have deliberately intercepted this DNA sequence with an ‘X’ to break up the runs of consecutive CAG/CAA codon runs.

Firstly, I convert the DNA sequence to uppercase using the .upper() method. I then use the regex module, where I search for the CAG or CAA codons that occur three or more times. The pipe character, | represents or, and the curly braces, give the limit for how many times these sequences need to be found (here, the upper limit is left off, after the comma).

Therefore, in this example, I want to find runs of CAA/CAG occurring more than three times in the DNA string, meaning if this requirement is satisfied a length of DNA of at least 9 should be found. I then iterate through the matches found, saved in the variable pattern_match, and print the sequences that fulfill this criteria. For clarity, I also append these sequences to a new list and print their lengths.

2 runs are found and both are above the minimum threshold of at least 9.

Working with Files

To open the file I use the open function, using the with statement. I then slice the file, because I do not want the Accession name (header) of the FASTA file, contributing to my DNA count. This means I need to slice the file at element 0, which removes the header, and I then read the sliced file f, being careful to remove any new line characters.

-An example illustrating the impact of not removing new-line characters is shown in the image below. These characters would break the consecutive runs of CAG/CAA, and add extra length to the DNA which would not be representative of the sequence data.

The impact of forgetting to remove new-line characters; the sequence is artificially longer and not representative of the sequence data.
The impact of forgetting to remove new-line characters; the sequence is artificially longer and not representative of the sequence data.

I then use the re.finditer method as described above. For this code snippet to work, it is necessary to import islice and re.

from itertools import islice import re

Shown below is the script, and the console output. As indicated, the accession name is shown, followed by the length of the sequence, how many runs of CAA/CAG greater than three are found, the sequences of those runs and their respective lengths and tandem repeat numbers.

This code however, could be much more informative and easier to interpret. For example, a printed message before each section could inform the user what the output represents. In addition, this code is a little inflexible, as it only works with ‘test.fasta’ and checks for 3 or more consecutive runs of CAA/CAG. To improve upon this script, the most convenient option is to transform it into a function.

The output in the console is hard to interpret, and could include informative text beside.
The output in the console is hard to interpret, and could include informative text beside.

Function Creation

Now that the main body of the code has been written, creating a function will not require too much additional work. As specified, the two parameters that the user should be able to choose for themselves should be the DNA file to analyse and the limit of the tandem repeats.

I define the function huntington_tandem_repeat_dna, and give the function two parameters inside the parentheses. The dna parameter is the file that is opened, and the tri_nucleotide_repeat_num has a default value of 3, in cases where the user decides not to specify the tandem repeat number.

To try the function, lets test two cases. One is a dummy test file I created, and saved with the extension .fasta. The contents of this file is shown below. I have included 2 consecutive runs of CAG/CAA codons in this file. One is 24 base pairs long and the other is 12. I have included this file to validate the function is working as expected.

Boxed: The boxed section refers to a a 24 base pair segment, comprised of 8 tandem runs of CAA/CAG codons.
Boxed: The boxed section refers to a a 24 base pair segment, comprised of 8 tandem runs of CAA/CAG codons.

The files I am running are in my local directory, and as such I have only specified their relative path.

The other DNA sample I run, is the mRNA sequence. Here, I wanted to make sure that the number of base pairs match up when the file is analysed. The NCBI database informs me that the sequence length should be 13,498bp and indeed this is confirmed when the huntington_tandem_repeat_dna function is called with this DNA file. Further to this the accession names match up in both cases confirming that all is working as intended.

The output in the console is explicit.
The output in the console is explicit.

Summary

To go full circle, the original aim has been met. The function reads the file and gives useful output which is clear to understand.

To pick on the second example shown in the image above, we can clearly see, that the DNA length is 13,498 bps, the user has searched for 23 or more consecutive runs of CAG/CAA, but only one run that matches that criteria is found. The sequence of that match is printed, along with its length, and the number of tandem repeats that comprise that run.

The function is now perfectly amendable for scaling up, such as for processing multiple files and writing the output to any particular destination should the user want to add this extra functionality.

I hope this read showcased how writing functions in Python can deliver insight into human genetic diseases. For more code examples, follow me on linkedIn and Medium.


Related Articles