The world’s leading publication for data science, AI, and ML professionals.

Optimising PCR protocol with a primer recommendation system using neural network

Automating a routine wet-lab task with a simple neural network

While I was an intern at a pathological lab four years ago, my work consisted primarily of sequencing DNA samples. While this was a fairly simple technique to master, for a second-year undergraduate, it gave me a roller-coaster experience of wet-lab research. Nevertheless, aside from the routine pipetting and centrifuging and mixing solutions, I noticed one could methodically automate and optimise the initial step of the protocol, namely the primer selection step.

Sequencing in the nutshell

DNA is double-stranded structure composed of 4 nucleic acid bases: adenine (A), cytosine (C), guanine (G), and thymine (T); such that a base on one strand will generally bind in a complementary (i.e. exclusive) manner to another base on the other strand (i.e. A will bind to T, whereas C will bind to G). When mutation arises, this exclusivity is broken, and can result in changes in the amino acid sequence and affect subsequent protein translation process. We can interrogate the genetic composition by a process of sequencing (also known as genotyping).

To perform DNA sequencing, we need to first amplify the region of interest through a technique called polymerase chain reaction (Pcr), whereby changing the temperature, the double stranded structure is separated and annealed. It is through this cycles of "hot-again-cold-again", the DNA replication is facilitated via polymerase enzymes, and repeated until a sufficient number of DNA segments is acquired.

Similar to PCR, sequencing also relies on the rising and dropping the temperature and of the polymerase enzyme. The main difference between the two is in their use of nucleic bases. In the PCR solution, the nucleic bases used are composed of exclusively deoxynucleoside triphosphate (dNTP), whereas in the sequencing solution, the nucleic bases used are composed of both dNTP and di-dNTP (ddNTP). The only difference between the ddNTP and dNTP is that the ddNTP does not allow further addition of bases beyond it during the replication process, whereas dNTP does. You can then imagine after the sequencing process, you get multiple DNA segments of different sizes, some will be of only 10 bases, whereas some will be of 20 or 30. Additionally, if you put some sort of signal on the ddNTP, so that they can give off colour when are put under a light, you can know what nucleotide is at the 10th, or 20th or 30th position. Doing this enough times, you can get the full nucleic composition of the gene of interest.

Primer selection

Primer is a short DNA segments of several base pairs. Primer generally has complimentary bases to the DNA segment where one wants to initiate the DNA replication process and without it, the polymerase cannot bind to the DNA strand.

As a rule of thumb, one can choose the primer based on its length and melting temperature (Tm). However, doing so by hand doesn’t allow one to easily visualise the secondary structure stability and possible occurrence of the primer hairpins (these are cases when the primers do not bind to the region of interest, but rather bind to themselves). Therefore, one can resort to publicly available softwares such as this one by Thermofischer. However, as often the case in academia, one software does not contain all the features, and different softwares can give differing results. As such the purpose of this project is to amalgamate these softwares and potentially create an automated primer recommendation system.

Method and game plan

Melting temperature calculation

Quick search shows that research into thermodynamics of nucleotide hybridisation is fairly vast, with different groups giving different ways of calculating Tm. Some formulae are fairly basic, which require only the number of bases for each nucleotide, while other such as the one in equation 1 are more advanced, which require the experimental value of nucleotide pairing enthalpy and entropy. But it is not exactly rocket science to figure out how to plug different values into a formula, and with enough literature review you can get the table 1.

Equation 1. Provided by SantaLucia et al. (1996)
Equation 1. Provided by SantaLucia et al. (1996)
Table 1. Unified oligonucleotide dH and dS parameters in 1 M NaCl at 37C
Table 1. Unified oligonucleotide dH and dS parameters in 1 M NaCl at 37C

Visualisation

One of the missing features in many existing software is the visualisation of the hybridisation between the primer and the DNA region. One way we could hardcode it into the program is by assigning a "shift-by" value to each of the nucleotides on one strand, that corresponds to the position of its complementary base on the other strand.

Figure 2. Visual representation of the algorithm to calculate stability and visualise secondary structures
Figure 2. Visual representation of the algorithm to calculate stability and visualise secondary structures

In the figure 2A, for each nucleotide in primer 5′- CCAAGGGTAGTAATGTATATGTGAG-3′, the computer code assigns a shift-by value, which corresponds to the position of its complementary bases within its reverse sequence (or in the case of primer dimer, the other primer’s reverse sequence) in 3′-5′ direction. For instance, the last base, G, in 5′-3′ direction, has shift-by values of -1 and 0, because counting from that position in 3′-5′ direction, there are two complementary bases, one at the same position (0), and another at the previous position (-1). In the figure 2B, by extracting the same shift-by value across all nucleotides, one can subsequently visualise all hybridisation events occurring in a secondary structure. For example, by positioning the 5′-3′ primer 6 positions to the right against its reverse sequence, one can construct a self-dimer structure with 6 bonds. In figure 2C, using the same shift-by value, one can also delineate a hairpin structure, where the size of its loop, formed between the innermost two bonds, must always be greater than or equal to three nucleotides. Moreover, thermodynamic values for any secondary structure are calculated as the sum of energy required for each bond formation.

In python, the shift-by value list can be hardcoded with a few lines of list comprehension as such:

and from then we can draw boxes and sticks to visualise the hybridisation events:

Primer selection recommendation system:

Another quick literature review gives us the following criteria for primer selection:

Table 2. Criteria for good primers
Table 2. Criteria for good primers

While we can just hardcode the criteria to our programme, and have it outputs whether our selected primer satisfies all the criteria, we can go a step further and create a recommendation system, that scans across all possible primer in a given gene of interest, and rank them according to their suitability.

To do this, we can first build a database of primers, and then train a model to classify a model as good or bad, or assign them with a suitability score. In this case, I have opted for the classification model. Here, I have generated a set of 78 206 unique primer pairs of varying lengths between 8 and 41 bases. They were randomly generated by the computer from three different genes. From that dataset, 12 600 random combinations were labelled as "Good" or "Bad". The "Bad" primers are defined as those that have abnormal value in one of the 9 features in Table 2. To achieve total homogenous distribution across all reference points, the "Bad" primers set was consisted of 9 uniform classes of features, each with 700 pairs.

Model training

The labelled dataset was divided into training set (70%), cross-validation set (20%) and test set (10%). Here, I utilised R neural-net package to optimise calculation and visualisation of ANN models, and adopted the resilient back-propagation with weight backtracking (rprop+) method, where the learning rate variated between 0.5(-) and 1.2(+) according to the sign of the partial derivative.

Figure 3. The trained ANN model
Figure 3. The trained ANN model

My first GUI

One problem I noticed with existing software is their disjointed features, and lack of transparency when displaying the results. Utilising the tkinter library in Python I have created an interface with the following features: 1) Customisable input, 2) Live result visualisation and calcuation, and 3) Organised output.

Figure 4. The interface with help
Figure 4. The interface with help
Figure 6. Organised output with transparent results
Figure 6. Organised output with transparent results

Relative importance

Once the weights have been assigned to each of the 9 features after the training process. We can see which of the features are the most determining of the suitability of the primer selection.

Figure 7. Ranking of the weights by the ANN model
Figure 7. Ranking of the weights by the ANN model

As suspected, differences in the Tm and Length play the most important role, followed by GC content and other secondary structures. Finally, we can hardcode such weights into our program and can get label for each primer as either good or bad.

Conclusion

This was an interesting exercise, where I learnt how to create a GUI from scratch and implement a simple ML model to everyday, routine wet-lab work. In reality, this will not likely make a whole lot difference, as in most cases, even a slightly bad primer will likely to work fairly well in a PCR experiment. Secondary structure hybridisation do occur, however, that will only be a problem when one is using a very long primer of more than 50 bp, and when that does occur, there are other optimising chemical element in the PCR solution to counter that problem. Nevertheless, if you would like to try out this GUI, you can find it below.

lehai-ml/primer-design

References

  1. SantaLucia Jr. Allawi, H.T, Seneviratne, P.A., (1996). Improved Nearest-Neighbor Parameters for Predicting DNA Duplex Stability. Biochemistry 35, 3555–3562.

Related Articles