Affimer Proteins

Affimer Proteins: Next Generation Sequencing Data Analysis (Part 2)

Amino acid translation and finding Affimer loop regions

Rob Harrand

Published in

Towards Data Science

7 min readJul 20, 2021

In part 1 we looked at what Affimer molecules are and started to look at basic data analysis in R relating to DNA sequences. Next we’ll go deeper into the data, searching for specific Affimer molecule loops, and looking at how different ‘reading frames’ can affect the search for specific sequences.

DNA -> Amino Acids

Ultimately, we’re not actually interested in the DNA sequence of the Affimer binders. Instead, it’s the amino acid (AA) sequence that’s of interest. The reason for this is that it’s the amino acids that form the protein, which in turn gives rise to the physical characteristics (and binding properties) of the Affimer molecules (specifically, the loop regions).

The physical process of DNA to AA translation starts with a process called transcription. First, an enzyme called RNA polymerase unwinds the double-helix structure of the DNA molecule, exposing the DNA bases. Next, the enzyme creates a complimentary single-stranded molecule, called messenger-RNA (mRNA), meaning that each C on the DNA strand becomes a G in the mRNA strand, each G to a C, each T to and A, and each A to a U (note that the ‘U’ replaces the ‘T’ that’s present in DNA).

Further steps occur to mature this ‘pre-mRNA’ into the final form of the mRNA, and then the mRNA sequence is read by structures called Ribosomes. This reading process works by looking at sets of 3 bases at a time (known as codons), each of which either encode a single amino acid, or tell the ribosome to stop (known as a stop-codon). The string of amino acids continues to build until a stop-codon is reached, with the resulting chain of molecules being the final generated protein.

Incredibly, depending on whether the amino acid conversion starts on the first, second or third base, you can end up with a completely different collection of 3-base sets, and therefore completely different amino acids and a consequently completely different protein. These are called reading frames and we’ll return to them later.

Below is a diagram of the codon to amino acid conversation (start from the centre and move outwards, selecting 3 amino acids). Note that different DNA combinations can lead to the same amino acid and that DNA -> AA translation is straight-forward, but AA -> DNA translation is impossible (as there is no way to know which DNA sequence was behind a given AA). This has implications for data management, i.e. AA sequences and their underlying DNA sequences must always be retained and linked in some way.

Now let’s translate from DNA to AA using the translate function. This function is applied to each DNA sequence, and then the data is combined into a dataframe (a 2D data-structure commonly used in R),

#AA translation
aa_1 = translate(dna_1) #Translate
aa_1_dt = as.data.frame(aa_1) #Add to a dataframe
 
aa_2 = translate(dna_2) #Translate
aa_2_dt = as.data.frame(aa_2) #Add to a dataframe
 
#Create a label for the direction,
aa_1_dt$read_direction = ‘F’
aa_2_dt$read_direction = ‘R’
 
#Combine the two dataframes,
aa_all = rbind(aa_1_dt, aa_2_dt)
colnames(aa_all) = c(‘amino_acid_seq’, ‘read_direction’)

Here is what we now have (the first 6 items),

head(aa_all)## amino_acid_seq                            read_direction
 ## 1 YGKLEAVQYKTQVLANINETYNINESTNYYIKVRAGDNKYMHLKVFNGPFI F
 ## 2 YGKLEAVQYKTQVLANINETYNINESTNYYIKVRAGDNKYMHLKVFNGPFI F
 ## 3 YGKLEAVQYKTQVLAEIGHTYTHREESTNYYIKVRAGDNKYMHLKVFNGPT F
 ## 4 YGKLEAVQYKTQVLATWENTYSTNYYIKVRAGDNKYMHLKVFNGPFIFTYN F
 ## 5 YGKLEAVQYKTQVLANINETYNINESTNYYIKVRAGDNKYMHLKVFNGPTH F
 ## 6 YGKLEAVQYKTQVLATWENTYSTNYYIKVRAGDNKYMHLKVFNGPNINETY F

We can also see some of the reverse sequences (the last 6 items),

tail(aa_all)## amino_acid_seq                              read_direction
 ## 995 VESTNYYIKVRAGDNKYMHLKVFNGPTHIRTYFIVEADRVLTGYQVDKNKD R
 ## 996 ESTNYYIKVRAGDNKYMHLKVFNGPNINETYSEVENADRVLTGYQVDKNKD R
 ## 997 ESTNYYIKVRAGDNKYMHLKVFNGPNINETYSEVENADRVLTGYQVDKNKD R
 ## 998 ESTNYYIKVRAGDNKYMHLKVFNGPNINETYSEVENADRVLTGYQVDKNKD R
 ## 999 VESTNYYIKVRAGDNKYMHLKVFNGPTHIRTYFIVEADRVLTGYQVDKNKD R
 ## 1000 ESTNYYIKVRAGDNKYMHLKVFNGPNINETYSEVENADRVLTGYQVDKNKD R

Our DNA sequences are now AA sequences, and we’ve labeled them either ‘forwards’ or ‘reverse’ (which we’ll need later).

Loop Hunting

The next question is, how do you find the two loop regions amongst those long lines of amino acids? The answer is that each Affimer molecule has encoded in it (from the starting DNA that goes into the phages) specific, short sequences that act like way-markers for the loops. There is one before and one after each of the two loops, i.e. 4 in total. Let’s define these ‘loop-pads’ in a dataframe,

loop_pads = data.frame(type = ‘demo’,
                       l2_before =’KTQVLA’,
                       l2_after = ‘STNYYI’,
                       l4_before = ‘KVFNGP’,
                       l4_after = ‘ADRVLT’)
 
loop_pads## type l2_before l2_after l4_before l4_after
## 1 demo KTQVLA STNYYI KVFNGP ADRVLT

(Note that I’ve called these ‘demo’, as we’re looking at a demonstration dataset).

Let’s do a search to see if we can find the first loop-pad. Below we’re setting the search pattern to the ‘before loop 2’ loop-pad, and the subject to be the first sequence in the dataframe,

matchPattern(pattern = loop_pads$l2_before, 
             subject = aa_all$amino_acid_seq[1])## Views on a 51-letter BString subject
## subject: YGKLEAVQYKTQVLANINETYNINESTNYYIKVRAGDNKYMHLKVFNGPFI
## views:
## start end width
## [1] 10 15 6 [KTQVLA]

It’s found it, starting at position 10 and ending at position 15. What about the ‘after loop 2’ loop-pad?

matchPattern(pattern = loop_pads$l2_after, 
             subject = aa_all$amino_acid_seq[1])## Views on a 51-letter BString subject
## subject: YGKLEAVQYKTQVLANINETYNINESTNYYIKVRAGDNKYMHLKVFNGPFI
## views:
## start end width
## [1] 26 31 6 [STNYYI]

Again, it’s found.

We now know that the ‘loop 2’ region is between position 16 (one after the end of the first loop-pad) and position 25 (one before the start of the second loop-pad),

It’s then trivial to extract this loop, using the substr function and specifying the start and end positions,

substr(aa_all$amino_acid_seq[1], 16, 25)## [1] “NINETYNINE”

As you can see, we’ve found the word ‘NINETYNINE’. This is clearly an artifact of our demo data, but in a real dataset, this would be our ‘loop 2’ amino acid sequence.

Reading frames

As discussed above, DNA uses sets of three nucleotides (codons) to encode amino acids. The consequence of this is that transcription can start from either the first, second or third nucleotide, leading to three so called reading frames. Crucially, each reading frame will lead to completely different amino acids being translated.

Start codons, or codons that tell RNA polymerase where to start, dictate which individual codon to begin at, leading to what’s known as the open frame, or the reading frame that leads to the translation into amino acids and then a protein. Sometimes, insertions or deletions can lead to what’s called a ‘frameshift mutation’, which is responsible for many severe diseases.

In our data, it’s important to ascertain which reading frame encodes the required Affimer protein sequence. Let’s see this in action. First, we’ll create a random, example DNA sequence,

dna_eg = DNAString(‘TGATATACGGATCGATGCATTCAGGACGCTCTGCTGGATAAGAACACCCTGTGGAAAACCATGTACTACCTGACC’)dna_eg## 75-letter DNAString object
## seq: TGATATACGGATCGATGCATTCAGGACGCTCTGCTGGATAAGAACACCCTGTGGAAAACCATGTACTACCTGACC

Next, we’ll translate it into amino acids as before,

aa_eg_frame1 = translate(dna_eg) #Translate, frame 1
aa_eg_frame1## 25-letter AAString object
## seq: *YTDRCIQDALLDKNTLWKTMYYLT

We go from a 75-letter DNA sequence to a 25-letter AA sequence (as expected). Now, let’s say our loop-pad is the sequence ‘ENHV’. Below we’ll search for it in our AA sequence,

matchPattern(pattern = ‘ENHV’, 
             subject = as.character(aa_eg_frame1))## Views on a 25-letter BString subject
## subject: *YTDRCIQDALLDKNTLWKTMYYLT
## views: NONE

Nothing is found. Next, let’s remove the first nucleotide and translate again,

#trim first base from sequence
aa_eg_frame2 <- DNAStringSet(dna_eg, start=2)#Translate, frame 1
aa_eg_frame2 = translate(aa_eg_frame2)## Warning in .Call2(“DNAStringSet_translate”, x, skip_code,
## dna_codes[codon_alphabet], : last 2 bases were ignoredaa_eg_frame2## AAStringSet object of length 1:
## width seq
## [1] 24 DIRIDAFRTLCWIRTPCGKPCTT*

We now get a completely different AA sequence. Notice that the translate function also gives us a warning, stating that the last two bases were ignored. This makes sense, because we’ve dropped a base at the start and no-longer have a DNA sequence length that’s a multiple of 3. Next we’ll search again,

matchPattern(pattern = ‘ENHV’, 
             subject = as.character(aa_eg_frame2))## Views on a 24-letter BString subject
## subject: DIRIDAFRTLCWIRTPCGKPCTT*
## views: NONE

Still nothing. Finally, we’ll drop the first 2 bases from the original and redo everything,

#trim first base from sequence
aa_eg_frame3 <- DNAStringSet(dna_eg, start=3)#Translate, frame 
1aa_eg_frame3 = translate(aa_eg_frame3)## Warning in .Call2(“DNAStringSet_translate”, x, skip_code,
## dna_codes[codon_alphabet], : last base was ignoredaa_eg_frame3## AAStringSet object of length 1:
## width seq
## [1] 24 IYGSMHSGRSAG*EHPVENHVLPD

Now we see a warning about a single trailing base. Finally, we search again,

matchPattern(pattern = ‘ENHV’, 
             subject = as.character(aa_eg_frame3))## Views on a 24-letter BString subject
## subject: IYGSMHSGRSAG*EHPVENHVLPD
## views:
## start end width
## [1] 18 21 4 [ENHV]

This time the sequence is found. In other words, the Affimer molecule sequence was encoded in the 3rd reading frame. This highlights the fact that the reading frame is absolutely crucial when searching for amino acid sequences when starting from a DNA sequence.

Summary

In this post we’ve taken the basic data analysis further, searched for loops, and have seen the importance of using the correct reading frame. In the third and final post we’ll see how ‘unique molecular identifiers’ can help eliminate NGS read errors, how Affimer loop frequencies change over several ‘panning rounds’, and wrap up with a look at Affimer protein applications.