Affimer Proteins

Affimer Proteins: Next Generation Sequencing Data Analysis (Part 3)

Unique molecular identifiers and panning for Affimer Binders

Rob Harrand

Published in

Towards Data Science

12 min readJul 20, 2021

In part 2 we took the basic data analysis further, searched for loops, and saw the importance of using the correct reading frame. In this third and final post we’ll see how ‘unique molecular identifiers’ can help eliminate NGS read errors, how Affimer loop frequencies change over several ‘panning rounds’, and wrap up with a look at Affimer protein applications.

Unique molecular identifiers

Unique molecular identifiers, or UMIs, are ‘molecular barcodes’ (short DNA sequences) that are used to identify sequences from a mix of several different sequences. The idea is simple. First, tag your starting sequences with a UMI. Next, carry these UMIs through the stages of DNA amplification. Finally, in your data, search for these barcodes in order to group sequence ‘families’ together.

At first glance, this may seem like a waste of time. Why search for a UMI to identify a sequence, when you can just look for the sequence itself? The reason is that the NGS reading processes is not perfect, and errors can be introduced along the way. Because the UMIs are relatively short, they’re less prone to these errors. Then, by identifying what are meant to be groups of the same sequence, these sequences can then be compared and something called the consensus sequence determined, by looking for the most common nucleotide at each position. This then reduces the errors introduced by the reading stages. The diagram below gives an overview, showing the resulting sequence families that share UMI sequences,

NGS works by first fragmenting the DNA sequence of interest, attaching two different kinds of molecular binders to end each, and then attaching one end of these fragments to a surface (a flow cell). Polymerase chain reaction (PCR) amplification steps then follow, leading to many copies of each fragment on the surface, including both original (forward) and reverse copies. Finally, individual nucleotides that are fluorescently labeled are added to the flow cell, which then bind and can be consequently imaged and read. In standard DNA applications, the fragment reads are then compared to a reference sequence, and any overlapping regions aligned with an appropriate algorithm. Admittedly, this is a very basic description of a complex process. To learn more, see this link.

During this NGS process, there are both PCR errors and read errors. The read errors are caused by the imaging system’s inability to perfectly resolve and distinguish the different colours from the different nucleotides. As mentioned above, this is where UMIs can help.

UMIs are described using something called IUPAC (International Union of Pure and Applied Chemistry) ambiguity codes. These are codes that either designate a specific nucleotide, or one of a selection of nucleotides. We can see that these by calling the following function,

IUPAC_CODE_MAP## A C G T M R W S Y K V 
## “A” “C” “G” “T” “AC” “AG” “AT” “CG” “CT” “GT” “ACG” 
## H D B N 
## “ACT” “AGT” “CGT” “ACGT”

Here you can see that ‘A’ represents ‘A’, but a character like ‘Y’ can mean ‘C’ or ‘T’. The point is to allow different sequences to exist for a given ambiguous code, but still to specify some order to the UMIs. For example, take a IUPAC ambiguity code of ‘NNABNNABNNABNNABNNABNNAB’, which is the sub-sequence ‘NNAB’ repeated 6 times. By having the ‘N’ characters in there, you’re giving the freedom for many possible UMIs, but at the same time, you’re limiting how many neighbouring nucleotides of the same type there can be. The highest number of repeating nucleotides you could have with something like ‘NNAB’ would be 3, i.e. ‘AAAB’ (where ‘B’ can be ‘C’, ‘G’, or ‘T’). Crucially, this reduces the potential read errors within the UMIs itself.

Let’s create an example sequence, which we’ll call ‘seq’, and two possible UMIs (where the second UMI is the same as the first, but with an extra ‘A’ at the start),

seq = “AATGTAGACAGGGATCAAGTTACTACGGATCGATGCATTCAGGACGCTCTGCTGGAATTCGTTCGTGTTGTTAAAGCGAAAGAACA”umi1 = “NABNNABNNABNNABNNABNNABN”
umi2 = “ANABNNABNNABNNABNNABNNABN”

The matchPattern function can work with IUPAC ambiguity codes, as long as we supply the parameter ‘fixed = False’ (i.e. the sequence is not fixed),

matchPattern(pattern = umi1, 
             subject = DNAString(seq), 
             fixed = False)## Views on a 86-letter DNAString subject
## subject: AATGTAGACAGGGATCAAGTTACTACGGATCGAT…TGGAATTCGTTCGTGTTGTTAAAGCGAAAGAACA
## views:
## start end width
## [1] 1 24 24 [AATGTAGACAGGGATCAAGTTACT]

Here we’ve found the first UMI in our example sequence, and it starts in position 1, as expected.

Now imagine that you had sequenced many sequences, which included sequences attached to these two different UMIs. Below is an example of 13 sequences with UMIs attached to the start,

sequences = c(“AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGCT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGCT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AAATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AAATGTAGACAGGGATCAAGTTACTACGGAT”,
              “AAATGTAGACAGGGATCAAGTTACTACGGTT”)

Below is a simple for-loop that iterates through this vector of sequences, and for each one, it checks to see which UMI sequence matches. This loop is specific to only 2 UMIs, but extending it would not be difficult. The basic concept is to check which UMI is at the very start of the sequence,

umi = vector()
 
for(s in sequences) { #Extracting the start position of umi1,
 umi1_start = start(matchPattern(pattern = umi1, 
                    subject = DNAString(s), 
                    fixed = False)) #Extracting the start position of umi2
 umi2_start = start(matchPattern(pattern = umi2, 
                    subject = DNAString(s), 
                    fixed = False)) #Is UMI1 is at the start or not? If not, it must be UMI2,
 umi_num = if(umi1_start == 1) {umi_num = 1} else {umi_num = 2} print(paste0(‘UMI number: ‘, umi_num))
 umi = c(umi, umi_num)
 
}## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 1”
## [1] “UMI number: 2”
## [1] “UMI number: 2”
## [1] “UMI number: 2”

We can see that we have ten sequences using UMI 1 and three using UMI 2. Now let’s put the sequences into a dataframe, along with the UMI number, and then split out the two different UMI families,

sequences = data.frame(sequences, umi)
umi_family_1 = sequences$sequences[sequences$umi == 1]
umi_family_2 = sequences$sequences[sequences$umi == 2]

Now we can call a function called consensusMatrix on these sequences (specifically the first UMI family),

consensusMatrix(umi_family_1)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
 ## A 10 10 0  0  0  10 0  10 0  10 0   0  0  10
 ## C 0  0  0  0  0  0  0  0  10 0  0   0  0  0
 ## G 0  0  0  10 0  0  10 0  0  0  10  10 10 0
 ## T 0  0  10 0  10 0  0  0  0  0  0   0  0  0## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27]
 ## A 0  0  10 10 0  0  0  10 0  0  10 0  0
 ## C 0  10 0  0  0  0  0  0  10 0  0  10 0
 ## G 0  0  0  0  10 0  0  0  0  0  0  0  10
 ## T 10 0  0  0  0  10 10 0  0  10 0  0  0## [,28] [,29] [,30]
 ## A 0  8 0
 ## C 0  2 0
 ## G 10 0 0
 ## T 0  0 10

The function has tallied the total number of each nucleotide in each position. For example, for the first position, you can see that there are 10 ‘A’ bases, and no ‘C’, ‘G’ or ‘T’ bases. This consensus occurs for all but one position. Can you spot it?

We can also call the function to display the data as a probability,

consensusMatrix(umi_family_1, 
                as.prob = TRUE)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
 ## A 1 1 0 0 0 1 0 1 0 1 0 0 0 1
 ## C 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 ## G 0 0 0 1 0 0 1 0 0 0 1 1 1 0
 ## T 0 0 1 0 1 0 0 0 0 0 0 0 0 0
 ## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27]
 ## A 0 0 1 1 0 0 0 1 0 0 1 0 0
 ## C 0 1 0 0 0 0 0 0 1 0 0 1 0
 ## G 0 0 0 0 1 0 0 0 0 0 0 0 1
 ## T 1 0 0 0 0 1 1 0 0 1 0 0 0
 ## [,28] [,29] [,30]
 ## A 0 0.8 0
 ## C 0 0.2 0
 ## G 1 0.0 0
 ## T 0 0.0 1

The difference is in position 29. Here, 8 of the 10 sequences has an ‘A’ in this location. In the remaining 2, they have a ‘C’. We can collapse these 10 sequences into one, using the most common base in each position using the consensusString function,

consensusString(umi_family_1)## [1] “AATGTAGACAGGGATCAAGTTACTACGGAT”

As you can see, we have an ‘A’ in the penultimate position, as it was the most common nucleotide across the 10 sequences. This way, amplification errors or read errors are ‘ironed out’.

Counting Affimer Molecules

Now that we’ve seen all the individual pieces, let’s put them together and count the different loop combinations (loops 2 and 4) that we have in our demo data. To keep things simple, we won’t worry about reading frames or UMIs.

Below is a large function that accepts the amino acid dataframe and loop-pad dataframe. It then goes through the dataframe, searches for all 4 loop-pads, extracts the loops, combines them into a new dataframe, and returns it. Read the comments within the function to see how it works. Note that the code below is inefficient and slow, and wouldn’t be used in a real-life NGS situation where datasets are very large. However, using loops in this way aids understanding and demonstrates the general principles at work,

#Function for amino acid sub-sequence matching,
 match_function = function(data, loop_pad, out_max=1, out_min=0) {
 
  #Extract the forward and reverse reads,
  data1 = data[data$read_direction == ‘F’,]
  data2 = data[data$read_direction == ‘R’,]  #remove the read directions as no longer needed,
  data1$read_direction = NULL 
  data2$read_direction = NULL
 
  #Create empty loop position columns,
  data1$Loop2_start = NA_integer_
  data1$Loop2_end = NA_integer_
  data2$Loop4_start = NA_integer_
  data2$Loop4_end = NA_integer_
 
  #Iterate through each forward read, search for loop-pads 1 and 2  (either side of loop 2), and if found, extract the start and end  positions,  i=1
  while(i<=length(data1$amino_acid_seq)) {
 
   matches_l2_left = matchPattern(pattern = loop_pad$l2_before, 
   subject = data1$amino_acid_seq[i], 
   max.mismatch = out_max, 
   min.mismatch = out_min)
 
   matches_l2_right = matchPattern(pattern = loop_pad$l2_after, 
   subject = data1$amino_acid_seq[i], 
   max.mismatch = out_max, 
   min.mismatch = out_min)
 
   if(length(matches_l2_left) != 0) {
   data1$Loop2_start[i] = end_pos_l2_left = end(matches_l2_left)+1}      else {data1$Loop2_start[i] = NA}
 
   if(length(matches_l2_right) != 0) {
   data1$Loop2_end[i] = start(matches_l2_right)-1} else {
   data1$Loop2_end[i] = NA
   }
 
   i=i+1
 
 }
 
 #Iterate through each reverse read, search for loop-pads 3 and 4 (either side of loop 4), and if found, extract the start and end positions,
  i=1
  while(i<=length(data2$amino_acid_seq)) {
 
    matches_l4_left = matchPattern(pattern = loop_pad$l4_before, 
    subject = data2$amino_acid_seq[i], 
    max.mismatch = out_max, 
    min.mismatch = out_min)
 
    matches_l4_right = matchPattern(pattern = loop_pad$l4_after, 
    subject = data2$amino_acid_seq[i], 
    max.mismatch = out_max, 
    min.mismatch = out_min)
 
    if(length(matches_l4_left) != 0) {
    data2$Loop4_start[i] = end_pos_l4_left = end(matches_l4_left)+1}    else {data2$Loop4_start[i] = NA}
 
    if(length(matches_l4_right) != 0) {
    data2$Loop4_end[i] = start(matches_l4_right)-1} else {
    data2$Loop4_end[i] = NA}
    i=i+1
 
 }
 
  #Extract the loop sequences from the full sequences, using the  start and end positions,
  data1$loop2 = str_sub(data1$amino_acid_seq, 
  start = data1$Loop2_start, end = data1$Loop2_end)
 
  data2$loop4 = str_sub(data2$amino_acid_seq, 
  start = data2$Loop4_start, end = data2$Loop4_end)
 
  #Combine the loops,
  affimer_binder = paste0(data1$loop2, ‘_’, data2$loop4)
  affimer_binder_df = as.data.frame(affimer_binder)
 
  #Return the combined loop data,
  return(affimer_binder_df)
 
 }

Now, run the function,

results_df = match_function(aa_all, loop_pads)

Let’s take a look at the first 6 loop sequences,

head(results_df)## affimer_binder
## 1 NINETYNINE_FIFTYNINE
## 2 NINETYNINE_FIFTY
## 3 EIGHTYTHREE_THIRTYFIVE
## 4 TWENTY_FIFTYNINE
## 5 NINETYNINE_THIRTYFIVE
## 6 TWENTY_NINETYSEVEN

Here we can see actual words spelt out. This is simply a deliberate aspect of the demo data, and done to provide a quick visual check on the success of any loop-hunting process.

Finally, we can use functions from the tidyverse R package to group by the Affimer protein sequences and count them,

#Group by the Affimer protein loop column and count,
affimer_families = results_df %>%
group_by(affimer_binder) %>%
tally()
 
#Order by this new count column,
affimer_families = affimer_families[order(affimer_families$n, decreasing = T),]

Then plot them using ggplot as a bar-plot,

ggplot(data = affimer_families, 
       aes(x=reorder(affimer_binder, X=n), y=n)) +
       geom_bar(stat=”identity”) + 
       coord_flip() + 
       xlab(‘Affimer’) + 
       ylab(‘Count’) + 
       theme_bw()

Affimer Molecule Panning

In real applications, the selection of an Affimer binder is done via a number of ‘panning rounds’ (or biopanning). This is where the phage display process is repeated a number of times, with candidate binders being left behind after washing steps, to be amplified and presented to the target molecule again. The consequence of this is a steady reduction in the variety of binding candidates with each round.

Below we’ll take a look at some example demo data. We’ll create 3 different plots, one from each mock panning round,

plot1 <- ggplot(data = affimer_families_p1,         aes(x=reorder(affimer_binder, X=n), y=n)) +
geom_bar(stat=”identity”, fill = ‘blue’) + 
coord_flip() + 
xlab(‘Affimer’) + 
ylab(‘Count’) + 
ylim(0,200) +
ggtitle(‘Panning round 1’) +
theme_bw()
 
plot2 <- ggplot(data = affimer_families_p2, aes(x=reorder(affimer_binder, X=n), y=n)) +
geom_bar(stat=”identity”, fill = ‘green’) + 
coord_flip() + 
xlab(‘Affimer’) + 
ylab(‘Count’) + 
ylim(0,200) +
ggtitle(‘Panning round 2’) +
theme_bw()
 
plot3 <- ggplot(data = affimer_families_p3, aes(x=reorder(affimer_binder, X=n), y=n)) +
geom_bar(stat=”identity”, fill = ‘red’) + 
coord_flip() + 
xlab(‘Affimer’) + 
ylab(‘Count’) + 
ylim(0,200) +
ggtitle(‘Panning round 3’) +
theme_bw()
 
grid.arrange(plot1, plot2, plot3, nrow=3)

As you can see, the frequency of the different binders changes with each panning round, with many becoming less frequent, and a handful becoming much more frequent. This is how the best candidates for a given target are found from a vast starting library. In this demo example, we would take ‘NINTYNINE_NINTYSEVEN’ forwards as our primary candidate.

In Closing

We’ve seen how nature has evolved various in-vivo mechanisms to create specific antibody binders for any encountered antigen. We’ve also seen a similar in-vitro process for creating Affimer binders. Both approaches, although not identical, rely on the principle of vast numbers of random DNA sequences, leading to random amino acid sequences and consequent randomly varied protein regions. These Affimer libraries, starting with 10^10 potential sequences, are then whittled down to the best binding candidates against a specific target via panning rounds.

From a data analysis perspective, the challenge is fundamentally about ‘loop hunting’, i.e. seeking conserved patterns of amino acids that pin-point the locations of the variable regions (with possible mutations along the way). Then, to see the variety of these regions diminish as panning rounds progress. In real-life experiments, reading-frames, UMIs, poor quality scores and point mutations add extra complexity along the way.

To give an idea of how effective Affimer binders are, a great example of an Affimer molecule application is the recent Affidx COVID-19 diagnostic test. This is a lateral flow device (LFD), where a sample from a nasal swab is mixed with buffer and applied to a paper strip. The liquid travels across the strip, where a line of COVID-19 virus-binding Affimer proteins await. If the virus is present, the Affimer molecules bind and immobilise the particles. A detection antibody then binds, creating a visual line for the user. These highly-specific Affimer molecules, which led to one of the most sensitive LFDs in the world being developed, were found via the panning method described above. From start-to-finish, the optimum binders were found in just 4 weeks.

This ability of Affimer proteins to be rapidly developed, coupled with their list of antibody-beating properties, make them ideal tools for a huge range of diagnostic applications. However, work is on-going to also exploit their properties for therapeutic applications. For example, a type of anticancer drug known as a ‘checkpoint inhibitor’ is currently in development, as-well-as a recent Nature publication on ‘druggable pockets’ that are notoriously difficult to identify.

Whatever the final application of a given Affimer binder is, the overall process is that described above, presenting an interesting and unique data analysis challenge.