Thoughts and Theory, Applications, limits, and future
The three months that have passed since the release of the AlphaFold 2 paper and code have seen many new articles and preprints coming out that analyze its potential and limitations, build on it to produce new discoveries, and harness its power to develop new structural databases that attempt to complete the knowledge gap in experimental structural data.
The Nature 2021 paper on AlphaFold 2 has over 200 citations (reviewed papers + unreviewed preprints) as of early October 2021. Of these works, most simply reference it in situations where other broader-scope citations would suit much better. In this story, I went through the work of summarizing the papers and preprints that I consider most relevant regarding the fields directly related to AlphaFold, stressing on its evaluations, its direct applications, and its use to build new structural databases.
Index
Introduction
First things first: the original evaluations of AlphaFold 2 in CASP14
- Need for new ways to carry out assessments
Extending the coverage of protein structure databases with models from AlphaFold 2 and RoseTTAFold
- Covering "all" structures in the protein universe
- A database of models of protein complexes
- Protein complex prediction with AlphaFold-Multimer
Assessment of AlphaFold 2’s predictions on what it was and it was not designed to predict
- Experimental structural biologists joined efforts to assess the utility of AlphaFold in their fields of research
- Prediction of protein-peptide complexes
- Predicting folded structures is not the same as predicting folding pathways
From a special issue of the Journal of Molecular Biology
Overview of other evaluations and potential applications
Further notes and reads
Introduction
Late last year, Deepmind made it big into the news when the Critical Assessment of Structure Prediction revealed that their program AlphaFold 2 had beaten all other participants by far. Then in July of 2021, just 3 months ago, Deepmind and AlphaFold 2 made it again into the news when its code was released. As I covered in this story, Deepmind further released an adaptation of the program to run as a Google Colab notebook. A further shock was when independent researchers released a whole set of Colab notebooks coupled to a special sequence search program, MMSeqs2, thus making the full power of this technology, which is strongly dependent on protein sequence alignments, available at wide and at zero cost.
Wide access to AlphaFold 2 in its different forms so promptly, quickly allowed researchers all around the world to experiment with it. Suddenly, many scientists were testing its limits and its applicability to real-world problems in the domain of structural biology. Tweets from all corners of the world showed how AlphaFold 2’s models helped them to more easily solve structures from experimental data, in some cases even correcting errors humans had introduced when solving the structures "manually". Sometimes they showed inconsistencies and problems in the predictions, too.
Other groups dedicated time to test if, given the quality of its predictions, AlphaFold 2, it could also predict other features of proteins. Their rationale was that, even though it was specifically trained to predict protein structures, if it had really learned the full protein biophysics (spoiler: it didn’t!) then it would likely be able to predict their dynamics, interactions, folding pathways, etc. And strikingly, some of these tests did find that AlphaFold 2 could make some progress in the prediction of some of these features.
Yet other groups, and Deepmind itself, applied AlphaFold 2 and other programs, sometimes even integrating them, to develop new structural databases from their predictions, attempting first to cover the single protein structures of all proteomes, and then to model multiprotein complexes at large.
Here is a non-technical summary of the works I found most relevant and interesting among all the "spin-offs" of AlphaFold 2 in the last 3 months.
First things first: the original evaluations of AlphaFold 2 in CASP14
Before I start, I go a bit back to highlight a paper that is very important but has been largely overlooked in the media: the original CASP14 evaluation that determined AlphaFold 2 is the best modeling program for hard targets (i.e. proteins that had no homologs in the Protein Data Bank, hence couldn’t be modeled easily) as of 2020/2021:
Topology evaluation of models for difficult targets in the 14th round of the critical assessment of…
This paper describes the official CASP14 analysis carried out by the designated assessors (the Grishin group at HHMI in the US, long-related to CASP) on all the models predicted by all CASP14 participants, considering specifically the problem of predicting structures for difficult targets -i.e. protein structures hard to model by classical techniques like homology modeling. By using metrics very similar to those used in previous editions of CASP, the assessors found that the top group (yes, AlphaFold 2) outperformed by far the rest of the prediction community. But not on all the difficult targets! For two targets, other groups of predictors performed better (yet AlphaFold 2 did get the global topology right). The assessors found that AlphaFold 2 provided highly accurate models for most of the targets, including some quite large proteins (larger proteins are usually harder to model). Moreover, some of its predictions were outstanding, reaching accuracies comparable to those of experimental uncertainties in atom positions, in many cases even for the amino acid sidechains (which are normally not even assessed in this track of CASP). Like the rest of the predictors, AlphaFold 2 still got it hard to model flexible regions and oligomeric assemblies (however, it did make progress and helped to improve their predictions -see the next sections).
In a separate paper, the same set of assessors showed that the targets available for CASP14 were not particularly easy; in fact, they were among the hardest ever:
Target classification in the 14th round of the critical assessment of protein structure prediction…
Need for new ways to carry out assessments
The high accuracy of AlphaFold predictions prompted a question that we had already advanced in CASP13. By that time, when the first edition of AlphaFold came into the game (winning it but marginally and not really implying a revolution like AlphaFold 2 did), we concluded that as new milestones were achieved in protein tertiary structure prediction, new directions needed to be devised for future CASPs:
A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for…
Specifically, we proposed dropping splitting the targets into domains, rethinking difficulty metrics, assessing not only backbones but also amino acid side chains, and assessing residue-wise and possibly residue-residue quality estimates. Accordingly, for CASP14, given the success of AlphaFold 2 (and also of other methods) the assessors performed a separate analysis and discussion about the new challenges for evaluation:
Assessment of protein model structure accuracy estimation in CASP14: Old and new challenges
One such further analysis entailed the evaluation of interactions predicted between the different domains that compose the protein targets:
Assessment of domain interactions in the fourteenth round of the Critical Assessment of Structure…
On assessing the models for whole protein targets (as opposed to evaluations of domains only, which is the main type of assessment typically carried out) through complementary scores that measure the quality of interactions between 3D units, and considering 10 specific targets, the highest-ranked predictor was again AlphaFold 2. In fact, it predicted highly accurate models for 8 out 10 full proteins, with scores on 3 of them being far higher than those of the models by all other groups.
One more highlight from CASP14’s papers is the finding of AlphaFold 2’s models useful for a technique used in X-ray-based structure determination of proteins called "molecular replacement", in which a model is used as the basis to solve a structure from experimental X-ray diffraction data. Although strictly speaking this had already been possible for some targets in previous CASP editions, the high quality of AlphaFold 2’s models enabled this to be done on more targets. In fact, for some targets the experimental data was available but not "phased" (step in experimental structure determination that is facilitated by the use of models through "molecular replacement"); in some of these cases the AlphaFold-provided models were useful to solve the phase problem and complete experimental structure determination (against then all models could be compared):
Assessing the utility of CASP14 models for molecular replacement
Extending the coverage of protein structure databases with AlphaFold 2 and RoseTTAFold
Covering "all" structures in the protein universe
I already discussed this paper in a previous story at TDS: the European Bioinformatics Institute partnered with Deepmind to develop the most complete database of structures of proteins, open for free to the whole community. EBI-Deepmind’s goal is to cover the full protein universe, that is to model structures for all proteins from all genomes ever sequenced. At the moment of writing this article, the database contains models for the full proteomes of 21 species, among them humans and species relevant to biotechnology and/or used as model organisms for research. Thus besides the human proteome, the database covers the proteomes of a couple of pathogenic organisms, baker’s yeast, rice, and deeply studied models like Arabidopsis thaliana (a model plant) and Caenorhabditis elegant (a model worm).
The models of all proteins in these 21 organisms are available for free download at __ https://alphafold.ebi.ac.uk/download. The EBI website further allows users to search protein sequences against the sequences of all the proteins with models; with this, users may find models for proteins very related to the one of their interest, and then use this model as a surrogate or for homology-modeling their own protein.
I will not spend more on this here because I covered the Deepmind-EBI effort in a dedicated section of the following article, highlighting important points such as the limitations of the models in this database among other aspects:
AlphaFold-based databases and fully-fledged, easy-to-use, online AlphaFold interfaces poised to…
But I will add here this preprint presenting a careful analysis which shows that from a current baseline of structural coverage of 47% for foldable protein regions (considering experimentally derived plus template-based homology models) the models currently available in the EBI-AlphaFold database bring the coverage up to 75%. In particular, 50% of the human proteome is covered with experimental structural data or high-quality structural models, which is even more impressive considering that around 20% of the proteome is just long disordered segments or fully disordered proteins.
The structural coverage of the human proteome before and after AlphaFold
A database of models of protein complexes
The EBI-Deepmind database keeps growing, and that’s great for the community of experimental and computational biologists, but it is focused only on the structures of individual proteins in monomeric forms. However, the CASP assessments found out that AlphaFold 2 seemed to have improved, relative to the state of the art at that time, in predicting the structures of complexes between proteins (further works further addressed this, see next section). Actually, not only the CASP assessment reached that conclusion, but also the CAPRI contest, which is dedicated specifically to evaluate and track the state of the art of protein-protein complex modeling:
Prediction of protein assemblies, the next frontier: The CASP14‐CAPRI experiment
A new work that just came out as a preprint exploited AlphaFold 2 and RoseTTAFold (a protein modeling method of accuracy approaching that of AlphaFold 2, that was published in Science at the same time as AlphaFold 2) to predict the structures not of single proteins but of their complexes. More specifically, the work presents a series of systematically identified, probably accurate models of core eukaryotic protein complexes of the Saccharomyces cerevisiae (yeast) proteome. RoseTTAFold and AlphaFold served to scan through paired multiple sequence alignments for 8.3 million pairs of yeast proteins to detect interacting pairs, filter cases for which accurate predictions should be achievable, and then build 3D models of their assemblies. The work resulted in a large new dataset of predicted protein complexes containing from two to five protein components each, that span almost all key processes in eukaryotic cells. Over a hundred of the modeled protein assemblies had not been previously identified, and over 600 were known to probably form complexes but their structures had not been characterized.
This preprint will for sure end up as a prominent paper in a tier-1 journal with an impact similar to that of the Deepmind-EBI collaboration for modeling all single proteins. You can read it now here (unfortunately the full list of modeled complexes has not been released yet):
A related preprint, actually sharing some author names with the preprint above, used a similar combination of AlphaFold 2 and RoseTTAFold to model complexes of human mitochondrial proteins:
Human mitochondrial protein complexes revealed by large-scale coevolution analysis and deep…
Protein complex prediction with AlphaFold-Multimer
Right when I was about to submit this article, a new preprint from Deepmind lands at bioRxiv. This preprint introduces a new AlphaFold model trained specifically for multimeric inputs of known stoichiometry ("AlphaFold-Multimer"). The developers report that it significantly increases accuracy of predicted multimeric interfaces over the regular AlphaFold just adapted for multiple proteins as the whole community has so far been doing. The new program achieves medium to high accuracy on complex-specific benchmark datasets, even for cases without templates available:
Assessment of AlphaFold 2’s predictions on what it was and it was not designed to predict
Strictly speaking, AlphaFold 2 was devised to predict protein structures, i.e. the relative positions in space of all the atoms that make up a protein. But protein structures are much more than just 3D atomic positions. As we saw above they can interact with each other to form complexes, they can also adopt multiple conformations, they can even lack a defined structure and be disordered, or have folded domains with disordered segments in-between. Even a well-folded protein can have some flexible loops.
Experimental structural biologists joined efforts to assess the utility of AlphaFold in their fields of research
This great preprint, out just last week, gathers evaluations carried out by experts on the potential of AlphaFold for predicting several features of proteins other than the structures themselves:
A structural biology community assessment of AlphaFold 2 applications
The work evaluated the use of AlphaFold 2 predictions in the study of characteristic structural elements of proteins, as a tool to assess the impact of mutations, for predicting functions and binding sites for small molecules (key because most drugs are in fact small molecules that bind to proteins), modeling of interactions (related to the section above and also to the one below), and modeling of experimental structural data. A few highlights from this preprint:
- Compared to what fractions of proteomes can be modeled by simple homology between protein sequences, the authors found that modeling with AlphaFold 2 increases the reach of confident models by a substantial amount, although this varies from proteome to proteome. Interestingly, they report that in some cases AlphaFold’s models identify structural features rarely seen in the PDB. I myself witnessed this on some examples of my own work applying AlphaFold 2.
- Predictions of protein disorder and protein complexes surpass state-of-the-art tools. Yes, AlphaFold 2 was not designed to predict disorder, but one of its self-quality metrics, which measures how confident is the prediction for each amino acid, turned out to be an excellent prediction of protein disorder.
- As many had anticipated and even the CASP14 evaluation had shown for a couple of targets, this work documents more cases where AlphaFold 2 models can be used to facilitate experimental structure determination.
- And for many proteins, the authors show that AlphaFold 2’s models are of such high quality that they serve for various applications as well as the experimentally determined structures if the confidence metrics are critically considered.
In most cases, the role of the confidence metrics is as important as the models themselves. And as the authors judge, it is clear from the assessments presented that the advances brought about by AlphaFold will have a transformative impact on structural biology and broader life science research. Nothing new, because we are already seeing this transformation in research.
Prediction of protein-peptide complexes
The preprints cited above show that AlphaFold 2 does work for modeling complexes between proteins. This preprint here assessed the more specific question of whether it could also model complexes between well-folded, "regular" proteins and short, largely disordered peptides. For this, the authors took entries from the PepBDB database of protein-peptide complexes, obtained their amino acid sequences, and ran AlphaFold 2 with no templates (critical because the reference structures are all available at the PDB).
The work found that overall AlphaFold 2 predicted the bound structures quite well, more than half of the cases with good accuracy and a substantial portion of them actually very accurate. There is one caveat, though, that I find here: although the authors dropped the use of templates for this, AlphaFold 2 is known to kind of "know" the Protein Data Bank "by heart". This means that if you feed it a sequence whose structure is already available in the PDB, it might be able to recover it very precisely even without any use of template information. In fact, some informal reports and Deepmind’s paper itself show that with large numbers of sequences in the input alignments, template information becomes less important and can in many cases be disregarded without compromising prediction quality.
Can AlphaFold2 predict protein-peptide complex structures accurately?
Important groups dedicated to developing methods for docking pairs of proteins are also exploiting AlphaFold to assist their own methods. Here this preprint, for example, combines the authors’ own method ClusPro to redock AlphaFold’s models of protein-protein complexes and then uses AlphaFold to refine the models:
Improved Docking of Protein Models by a Combination of Alphafold2 and ClusPro
This other work explored how to tune the input sequence alignment and model identification to improve AlphaFold 2’s predictions of protein-protein complexes:
Improved prediction of protein-protein interactions using AlphaFold2 and extended multiple-sequence…
This other preprint also explored the use of AlphaFold to model protein-peptide complexes:
Harnessing protein folding neural networks for peptide-protein docking
Predicting folded structures is not the same as predicting folding pathways
A lot was said when AlphaFold 2 came out that it had "solved the protein folding problem". For us scientists working with proteins, it was obviously not. One thing is what structures proteins achieve once they are folded, and a different thing is the mechanism by which proteins achieve their folded structures. Recall that proteins adopt their 3D structure by folding onto themselves (sometimes onto other proteins too) from an extended, flexible state.
This preprint compared the pathways generated by state-of-the-art protein structure prediction methods, including AlphaFold 2, to experimental data about the actual folding pathways. The comparison is tricky, because folding data rarely gives the positions of atoms over time; rather, most data is about how long the amino acids of the protein remain exposed to the solvent as the protein folds (amino acids get protected from the solvent as the protein folds because they form contacts and get buried inside the protein). The data is however complemented with physics-based simulations of how proteins fold, constrained by the known structures. In short, the work finds, not surprisingly to me at least, that current protein structure prediction methods do not produce correct folding pathways, not even when their final folded structures are highly accurate:
Current protein structure predictors do not produce meaningful folding pathways
From a special issue of the Journal of Molecular Biology
Articles invited for a special issue on the renowned Journal of Molecular Biology develop on and discuss about the impact of the major advances promised by AlphaFold 2 and related AI-inspired technologies applied to biology, and how scientists can build upon them. Scientists in different areas of biology discuss various questions reaching the consensus that AlphaFold will impact (I’d say it is already impacting) how research in the life sciences is done. Models of protein structures were always useful when confident, so increasing the confidence of models as much as AlphaFold did can have no other than impact.
Many of the articles in the issue are not about AlphaFold at all, yet provide interesting views into how major researchers in the fields of structural Bioinformatics, protein design, protein dynamics, protein folding, and protein biophysics in general think about the revolution that these methods, and machine learning in general, are bringing about:
AlphaFold: A Special Issue and A Special Time for Protein Science
I comment here only on articles that I found of special interest and have more to do with Machine Learning, but I recommend that you visit the link above for more.
In his article, multi-hat protein scientist Alan Fersht gives a personal perspective on how machine learning has impacted his area of research, broadly speaking that of protein biophysics. It’s especially interesting to read his view as he started his work when computers were a luxury and things such as the Protein Data Bank didn’t even exist. Plus, he knows the story of protein biophysics and structural biology very well, starting his article by the year 1968 when modeling of 3D protein structures started based on using a known structure as a template ("homology modeling"). At that time, computer science was growing fast and the field of machine learning was starting to develop, without even imagining that computers would be able to predict protein structures without templates half a century later but rather aiming at "simpler" problems such as playing games against humans. And also roughly at the same time, the problem of protein folding pathways was posed, which is essentially unresolved today (as presented in one of the preprints above, even when machine learning methods can predict the folded structures right, they cannot account for how the proteins fold into such structures). In fact the main topic around which Fersht’s paper revolves is whether computer programs will ever be capable of predicting protein folding pathways and help to clarify which principles govern how proteins fold.
Another article, by the Dill grou[p](https://www.sciencedirect.com/science/article/pii/S0022283621003508#s0075), reviews the protein folding problem in more detail, how this was set by the first experimental structures of globular proteins, and how early biophysicists approached its study, first by with polymer theories to model transitions between disordered and ordered structures and then moving to theories on collapse and on folding funnels. The review does not touch on machine learning for structure determination but it’s a must for any newbie in protein biophysics, as it discusses also the nature of disordered protein states, why folding is fast and directed, multiprotein associations, native-like and fibril-based aggregates, protein assembly, and phase separation (probably the fanciest topic on protein biophysics today) among other topics -everything with an interesting theoretical and historical perspective.
Related to the article commented above, other papers of the issue dedicate specifically to the problems of protein disorder and aggregation, one of them specifically touching on how machine learning methods can help to unveil the relationships between protein sequence, structure, dynamics, and function for disordered proteins. Among others, the article reviews methods to infer protein features directly from sequences, such as interaction motifs, functions, and disorder; besides, it also reviews how machine learning can help to predict protein-protein interactions (much presented above in this article), and to parametrize forcefields for molecular simulations, interpret experimental data, etc.
Two interesting papers of the issue discuss (membrane) protein design ([here](https://www.sciencedirect.com/science/article/pii/S0022283621003892) and here), which today relies heavily on experimental work and on computer-assisted design but not much using machine learning methods, yet the field is expected to be impacted by them, especially by generative networks -with some examples presented.
Another interesting article touches on a very important problem that protein prediction methods can help with, especially now that they are gaining accuracy: predicting functional effects of genetic variation. In their article, Diwan et al discuss AlphaFold’s predictions and the role of structures in assisting the prediction of genetic variations on phenotypes. In the extreme of utility, the kind of things one would want to predict is what mutations (the genetic variations) cause a disease (the phenotype), why they do so, and how we can compensate for these effects. One important highlight of the paper is that structures, either predicted or experimental, for the protein where the genetic variation falls is usually insufficient to understand how this variation relates to an observed phenotype. For a complete prediction of phenotype from genotype we need to understand structure, dynamics, folding pathways, complexes, interactions with other molecules, and possibly other aspects we don’t suspect about.
Overview of other evaluations and potential applications
- This preprint shows that none of AlphaFold’s predictions or parameters can be used as a proxy for protein stability change upon mutation. Although this was expected, it was worth exploring: https://www.biorxiv.org/content/10.1101/2021.09.19.460937v1
- The work in this other preprint developed an approach for protein design from fixed backbones, capitalizing on the predictive power of AlphaFold. For several designs the authors demonstrated that the AlphaFold-predicted structures were in agreement with the desired backbones, suggesting that AlphaFold and similar methods can facilitate the development of a new range of novel and accurate protein design methodologies -i.e. where a program is used to predict the amino acid sequences that will acquire a desired fold: https://www.biorxiv.org/content/10.1101/2021.08.24.457549v1
- This preprint evaluated how well AlphaFold 2 could predict specifically membrane proteins, an important point because the program was not instructed with any sets of rules specific for this type of protein. Their finding was that the predictions were outstanding: https://www.biorxiv.org/content/10.1101/2021.08.21.457196v1 It is important though to consider that there were already some membrane protein targets in CASP14.
Further notes and reads
(New papers relative to my previous stories are marked in bold)
Example use of AlphaFold on a very specific application to modeling certain egg coat protiens: https://onlinelibrary.wiley.com/doi/full/10.1002/mrd.23538
Example use of AlphaFold complemented with ad hoc alignments to model proteins of Trypanosoma and Leishmania parasites: https://www.biorxiv.org/content/10.1101/2021.09.02.458674v1
The original AlphaFold 2 paper in case you haven’t seen it: https://www.nature.com/articles/s41586-021-03819-2
And the original EBI-Deepmind paper presenting a database of models for 21 proteomes: https://www.nature.com/articles/s41586-021-03828-1
A new paper by Deepmind, explaining how they exactly employed AlphaFold 2 during CASP14: https://onlinelibrary.wiley.com/doi/10.1002/prot.26257 Notice that the team participated as a human group, as they performed expert-driven interventions. Based on what they learned in the process, they could polish the final fully automated program that we all know today.
RoseTTAFold, presumably almost as good as AlphaFold 2 though not yet benchmarked in any CASP: https://www.science.org/doi/full/10.1126/science.abj8754
EMBL/EBI release about modeling the protein universe: https://www.embl.org/news/science/alphafold-potential-impacts/
Main preprint about ColabFold, which enables you to easily use AlphaFold 2 and RoseTTAFold: https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1.abstract
Liked this article and want to tip me? [Paypal] -Thanks!
I am a nature, science, technology, programming, and DIY enthusiast. Biotechnologist and chemist, in the wet lab and in computers. I write about everything that lies within my broad sphere of interests, and part of this includes communicating science. Check out my lists for more stories. Become a Medium member to access all stories by me and other writers, and subscribe to get my new stories by email (original affiliate links of the platform).