Deep Learning for Biology
Notwithstanding the challenges posed by lack of adequate amount of data, several AI based tools are been developed and used recently to address the many challenges posed by the current pandemic. In the last 8 months, AI models have been used for, among other things, biomedical image classification for clinical diagnosis using CT-scan & x-ray images and predicting hotspots & outbreaks by tracking local news and social media accounts. This article however deals with the recent advances in application of AI to biomedical research for expeditious, and inexpensive discovery of drugs and vaccines to treat the novel coronavirus infection.
No prior understanding of virology or drug discovery is assumed in this article.
For those without adequate background in Biology, I begin by first discussing the structure and mode of action of virus in host cell before describing in detail three specific application of AI to SARS-CoV-2 research that I personally find to be most interesting and promising with respect to expediting our quest for a cure.
Technology poses significant challenges for wet-lab researchers. The unprecedented exigencies imposed by the current pandemic are driving new research collaborations between biomedical researchers and computer scientists. As a consequence, AI applications in biomedical research is a fast evolving field and my aim here is to provide a brief overview as opposed to an extensive survey of the field. For those new to deep learning may refer to my earlier article on inner working of convolution neural networks.
How Virus colonize a host cell
Virus are made of two essential components, proteins and nucleic acids(RNA or DNA). Proteins form protective shells (often more than one) that help safely transfer the encapsulated nucleic acids i.e. the viral genome from one cell to another. Once the viral genome (RNA in case of SARS-CoV2) is successfully transferred to the host cell, it undertakes swift replication to colonize the host by making use of the available to host cell resources. This deprivation the host of essential resources, ultimately results in cell death and release of viral particles that then go on to infect and colonize adjacent cells.

). (Right) SARS-CoV2 Structure: A diagramatic representation showing the key structural protein and the viral genome. The spike protein S, the envelope protein E, the membrane glycoprotein M, the Nucleocapsid protein N and the viral genome, RNA (source).](https://towardsdatascience.com/wp-content/uploads/2020/10/1Gr7G6flVZlyfwVjZQR85fw.jpeg)
While AI has been used in conjunction with experimental research in a few domains of drug discovery & vaccine development in the past, the exigency imposed by the current pandemic has made AI indispensable in accelerating the search for ideal SARS-CoV-2 vaccine candidates. This article describes the following three applications of AI pertaining to drug discovery:
- THE PROTEIN FOLDING PROBLEM : How a deep learning algorithm can be used to computationally predict the complex three dimensional structure of proteins associated with SARS-CoV-2, with just the amino acid sequence as an input.
- THE DRUG SCREENING PROBLEM : How a neural network based approach can help screen over a billion chemical compounds to identify those that can successfully target a particular protein associated with SARS-CoV-2 pathogenesis.
- THE VACCINE COVERAGE PROBLEM : How machine learning based models can be used to computationally design vaccine candidates with optimal coverage across different populations.
I must clarify at this point that the computational tools described in this article are meant to accelerate existing drug discovery pipeline by predicting outcomes of some key experimental techniques and are in no way an alternative to conventional experimental research. The outcome of these computational tools still needs to be verified using traditional wet lab techniques. And while many computational biologist and bioinformaticians are convinced that robotics and deep learning will render wet-lab skills obsolete in a decade, we still have a long way to go before we realize that.
Protein Structure Prediction
Proteins are essentially linear sequences made of 20 different types of amino acids that fold up into unique three dimensional structures. The specific sequence of amino acids determines the structure which in turn determines protein function. Apart from the four structural proteins shown in Fig.1 above, there are several non-structural proteins that play key role in virus pathogenesis along with human receptor proteins that help in entry of virus into human cells, all of which are suitable targets for vaccines development. Determining their exact atomic structure is therefore an essential prerequisite for developing vaccines that target one of these proteins. While the experimental techniques for determining protein structures can be very expensive and time consuming, the very complex interactions that give rise to protein folding patterns and three dimensional structures can now be characterization using advanced very deep learning models. Protein Folding Problem **** represents the holy grail of molecular biology research. The problem is simple, given the amino acid sequence of a protein, can you predict its three dimensional structure ?
While scientists have struggled with it for over 70 years, computer scientists have recently made very significant advances in providing a meaningful solution to the protein folding problem.
In January 2020 Deep Mind launched Alphafold, a neural network based framework for predicting protein structures using only the amino acid sequence as an input. Alphafold has been used to predict the structures of SARS-CoV-2 spike protein (known to mediate cell entry) and several other associated proteins. The predicted structures have been found to be remarkably close to structures that were subsequently determined using experimental techniques. Shown below is a comparison of the experimentally structure of protein ORF3a (a SARS-CoV-2 associated protein that has been implicated in inducing cell death) with the structure predicted computationally using AlphaFold.

In the denovo modelling workflow shown below, probability distribution of distances and marginal torsion angle between different amino acid residues are predicted using a ResNet deep learning model trained on ~30,000 experimentally determined protein structures that are available in the Protein Database. The final structure is then obtained by optimizing through gradient descent or simulated annealing to obey the ResNet predictions.

The AlphaFold neural network consists of 220 residual blocks like the one shown below, with each block comprising a sequence of convolution & batchnorm layers; two 1 × 1 projection layers; a 3 × 3 dilated convolution layer and three exponential linear unit (ELU) nonlinearities. More details can be found in the original publication. The trained network and user instructions can be found here.

This ability to predict structures of key protein associated with the virus lifecycle will accelerate design and screening of novel vaccine candidates that target on of these proteins. Apart from predicting structures of viral proteins, AlphaFold will also help in visualizing the interactions of many potential drugs and vaccines with the target proteins.
Deep docking for drug screening
Conventional drug discovery pipelines are extremely time and money intensive as they entail manual screening of hundreds if not thousands of potential drug molecules to arrive at one that exhibits both efficacy and safety. Deep Docking is a recently proposed neural network based platform for accelerated virtual screening of potential drug molecules by predicting the extent of interaction of a particular drug molecule in a database with a potential drug target of choice.

This model has been applied to estimate interaction of over a billion potential drug compounds from ZINC15 chemicals database with the SARS-CoV-2 Main Protease (MPro). Shown below are the interaction of the top ranked drug compound ZINC000541677852 (in magenta) with the MPro protein (in grey density map).

Apart from the drug compound visualized above in Fig. 6, 1000 other potential drug candidates were identified in this study. This ability to computationally screen millions of potential drugs to target a specific protein has the ability to truly transform the drug discovery pipeline. The results from such computational studies must be treated with immense caution and great deal of further experimental research is needed to establish the efficacy of proposed drugs.
Vaccine design with optimal coverage
Vaccines work by activating the natural immune response of the host cells. To help illicit this response, vaccines are often made up of viral proteins themselves. These peptides then bind the Major histocompatibility complex (MHC) class I and class II proteins which help activate the T-cell immune response when this complex (MHC + peptide) is displayed on the cell surface. This process is known as antigen presentation. The immune response generated by a vaccine depends on the sequence of peptide displayed in antigen presentation, the sequence of MHC proteins and the interaction between the two. This video provides a clear explanation of this process.
Of the 41 vaccine candidates currently undergoing clinical trials (as of 30 September 2020), at most a few are likely to both safe and effective in treating the novel Coronavirus. The ones that will succeed will certainly not exhibit same efficacy in all populations groups because the genes that are responsible for MHC proteins show significant variation in different populations. This introduces another variable in the vaccine design problem. Apart from being safe and effective, the vaccine most also have a broad coverage.
Recently, a machine learning based approach has been used by researchers at MIT to design vaccines optimum efficacy in different population groups (of black, asian and white ancestry).

The two evaluation metrics (EvalVax-Unlinked & EvalVax-Robust) are used as objective functions for the combinatorial optimization problem using OptiVax-Unlinked and Opti-Vax-Robust. Summarised in Fig. 7 above is the three step process where first the peptides (protein fragments) of desired property are screened. These are then scored on immunogenicity by a two step OptiVax optimization that identifies all possible peptide fragments derived from viral proteins in step one that could potentially lead to a vaccine. In the final step, the vaccine candidates are optimized for HLA allele frequency responsible for geographical variations in MHC protein sequences. This system delivers a vaccine that will be the effective in providing immunity to populations from different geographical areas and ethnicities. The proposed vaccine was shown to provide 93.21% population coverage for SARS-CoV-2 MHC class I formulation and 97.21% coverage for MHC class II formulations. More details on the optimization algorithms can be found at the end of the original publication.
Conclusion
In this article I have tried to explain what in my opinion are the three most important and promising interventions of AI in SARS-CoV-2 biomedical research. First, we discussed how AI can actually predict protein structures with very high accuracy. Then we discussed how inhibitors for any protein of interest can be screened for a very large dataset using a simple neural network. Finally, we discussed how vaccines with broader population coverage can be designed using machine learning. Below is a list of key publication, apart from those referenced in the article. Thanks for reading.
A Reading List
- Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Link
- Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Link
- A Visual Guide to the SARS-CoV-2 Coronavirus. Link
- AI could help with the next pandemic – but not with this one Link
- Improved protein structure prediction using potentials from Deep Learning. Link
- Opportunities and obstacles for deep learning in biology and medicine. Link
- A watershed moment for protein structure prediction Link