The world’s leading publication for data science, AI, and ML professionals.

What’s Up after AlphaFold on ML for Protein Structure Prediction?

Will the AI-powered revolution in biology keep going? Can we expect a new breakthrough? What's going on right now in the field of protein…

Automated Rolling Evaluations by CAMEO and the Next CASP

If you are specifically interested in the rolling evaluation by CAMEO and knowing the current state of protein structure prediction, click here.

If you want to know about the upcoming CASP15, how AlphaFold changed its focus, and its new challenges, click here.


Introduction

AlphaFold 2, the AI-based program developed by Google’s Deepmind to crack the problem of predicting protein structures, made a strike in late 2020 when it "won" the 14th edition of a biannual "contest" on protein structure prediction called CASP (Critical Assessment of Structure Prediction) presented its results. It then made a second strike half a year later when Deepmind published a peer-reviewed article in the journal Nature describing how AlphaFold 2 works, and released its code openly in GitHub and as a Google Colab notebook that everybody could use. The hype kept growing as scientists developed even better notebooks from it, and as they found the many applications that AlphaFold had, even beyond its original aim. This hype grew even further when Deepmind released a new version of AlphaFold better suited to modeling the complexes made by multiple proteins when they interact. Then again when Deepmind joined forces with the European Institute of Bioinformatics to release a database of 3D models for all known proteins. And even further when scientists cracked AlphaFold to use it backwards and thus design proteins that would fold as required to achieve certain functions. And probably even more than I’m missing.

Here are a couple of the articles where I covered AlphaFold 2, the notebooks available to run it, and the many positive roads that it opened up in Biology research:

Google colab notebooks are already running Deepmind’s AlphaFold v. 2

The hype on AlphaFold keeps growing with this new preprint

AlphaFold 2 spin-offs three months after its official release

And now, what’s up?

There were very strong peaks in interest on AlphaFold in late 2020 and mid 2021, and the second peak didn’t drop to 0 even after months. Rather, we can see a stable baseline, likely sustained by the many biologists still researching on AlphaFold and how to apply it to its problems:

Interest in the keyword "AlphaFold" retrieved from Google Trends. This and other pictures by author Luciano Abriata.
Interest in the keyword "AlphaFold" retrieved from Google Trends. This and other pictures by author Luciano Abriata.

But…

Will there be a new peak in interest, reflecting a new breakthrough?

What’s going on right now in the field of structure prediction?

A new round of the competition on protein structure prediction

Regarding the first question, I think we can expect a new peak interest, although not necessarily a breakthrough, because of the higher level of complexity of the new standing problems and because there is less data in the Protein Data Bank for AI methods to exploit. But, let’s see what CASP15 says!

CASP just announced its next edition, number 15th and thus approaching 30 years of evaluation of protein structure prediction (CASP1 took place biannually uninterrupted since 1994). In the latest CASP round, CASP14, nearly 100 groups from around the world submitted more than 67,000 models on 84 modeling targets. As in every CASP, independent assessors compared the models with the experimentally determined structures along various tracks, of which determining the tertiary structure of hard targets (i.e. those for which there is not much structural information available from similar proteins) is the most important one. (Or should we say "was", because AlphaFold 2 kind of solved that problem?) While you can know more about CASP14 and AlphaFold 2 in my previous articles (I was an assessor for CASP12 and CASP13 when AlphaFold came in, so I know about all this first-hand!), I want to stress here what the CASP organizers expect for CASP15.

As I just hinted, it is very likely that the focus will shift to predictions of quaternary structure, i.e. of how multiple proteins fold in 3D space together as they interact with each other. The official CASP website (https://predictioncenter.org/casp15/index.cgi) says clearly that the core of CASP remains the same: blind testing of methods with independent assessment against experimental structures to establish the state-of-the-art in modeling proteins and protein complexes. But going into more detail, the website also reveals some changes in the evaluation tracks.

First, tertiary structure prediction will not be split anymore into easy and hard targets, which kind of makes sense given that suddenly all tertiary structure predictions turned relatively easy. And not only due to AlphaFold but also due to other new tools that exploited AlphaFold-like and other new methods, such as RoseTTAFold from one of the classic CASP leaders.

In contrast, CASP15 will emphasize on the fine-grained accuracy of models, and will still play close attention to predictions of quaternary structure i.e. that of complexes formed by multiple proteins together. Although AlphaFold-Multimer improved this substantially, this isn’t yet as reliable as tertiary structure prediction. There will also be increased emphasis on assessment of accuracy estimates, a key feature of AlphaFold predictions that we anticipated in our CASP13 assessment.

CASP15 will drop some categories that don’t make much sense anymore. But it will keep that of accuracy estimation for protein complexes, in which predictors must rank models of protein-protein complexes modeled by others.

For the first time, CASP15 plans to experiment with 3 interesting cases that pose the next frontier in predicting structural biology, now that tertiary structure predictions were nailed and quaternary structure predictions have also advanced much: assessing the modeling of RNA molecules and protein-RNA complexes, in collaboration with RNA experts; modeling complexes between proteins and small molecules, which is at the heart of pharma because most clinically relevant molecules exert their action by binding to proteins; and predicting conformational ensembles i.e. multiple models that explain how proteins move in solution, critical because so far CASP has focused on rather static snapshots of proteins, but there are actually very dynamic.

What’s happened since CASP14 and AlphaFold? Insights from CAMEO, a rolling evaluation on protein structure prediction

Although much less popular than CASP, this is also a very interesting competition. It happens automatically, so there’s not much expert curation and analysis. But it’s always there to see and open for everybody to explore the most updated information about methods for protein structure prediction.

The name is CAMEO, and stands for Continuous Automated Model EvaluatiOn. You can visit its main page here:

CAMEO – Continuous Automated Model EvaluatiOn – Welcome

CAMEO is a community project sustained by the Computational Structural Biology Group at the Swiss Institute of Bioinformatics and the Biozentrum at the University of Basel, funded by these institutions and also by funds from the European Union. CAMEO continuously applies quality assessment criteria established by the protein structure prediction community to 3D models produced by a set of listed servers. It offers a variety of scores, assessing different aspects of a prediction such as the coverage of the query sequence, local accuracy, completeness, etc.

To date, most groups participating in CASP also have their servers tied to CAMEO. Besides, CAMEO produces naive AlphaFold 2 predictions for all targets. And you guessed it right, it’s almost always at the top! Even today almost 1 year after the formal release of its code and paper, meaning that even methods inspired developed after it have not been able to pass it. Although you could think this is because there’s actually a limit to how good predictions can be, and AlphaFold’s are already as good as they can get, CAMEO’s data shows that in fact many tools, even new ones, are not quite at its level.

You can get to see assessment data yourself on a dedicated page with interactive plots where you can choose to see results for only easy, hard, or medium targets, by all servers or by specific ones, in different periods of time. Here it is for hard targets by all groups in the last 3 months as of April 16th 2022:

The last 3 months of CAMEO for hard target modeling by all servers. Direct link as of April 16th 2022: https://www.cameo3d.org/modeling/3-months/difficulty/hard/?to_date=2022-04-16
The last 3 months of CAMEO for hard target modeling by all servers. Direct link as of April 16th 2022: https://www.cameo3d.org/modeling/3-months/difficulty/hard/?to_date=2022-04-16

As you see on the right, the analysis over 3 months includes 9 hard targets. The graph on the left plots the average LDDT over all models submitted by each group vs. the fraction of targets for which they actually submitted models.

Let me now zoom and add some annotations on it, to discuss in further detail:

In the plot I labeled some key names and performers, and also highlight various servers that contributed predictions for all targets but performed poorly (bottom right), the naive predictions that one would get by simply blasting the PDB for the PDB structure with the best sequence match ("Naive BLAST"), the best structural template in fact available, even if it might not be retrievable through BLAST search, the models retrieved right form the AF2-EBI database (which only cover certain proteomes, hence the low fraction of targets covered).

As key names and performers you’ll find pure AF2 predictions in the top right, which is where you want to be. Notice that having a template or not makes no substantial difference when running AF2, as already documented. Notice also that the deviation bar is quite important for it, which means there are some targets for which AF2 could not predict very good structures (and also that for some targets the predictions were excellent). Notice also how RoseTTAFold, which is assumed the closest competitor to AF2, has a lower average score and didn’t even model all targets (reasons for this are not given, but do not necessarily mean that the program cannot treat them). Finally, there are 3 other methods that performed quite well, even better than RoseTTAFold although not for all targets: PaFold, ZlxFold, and HelixOnAI, all apparently new.

These data from CAMEO seem to suggest a rather fierce competition but only for the runners-up, as AF2 still seems best. Although surprises can always show up, like when AF itself made its debut in CASP13. Moreover, as explained above CASP15 will focus on fine details, quaternary structures, ligands, and dynamics, so there might be new surprises there too. Who knows from whom.


Here’s a summary of all my articles and peer-reviewed papers on AlphaFold, CASP, and protein modeling

Here are all my peer-reviewed and blog articles on protein modeling, CASP, and AlphaFold 2


Have a job for me about protein modeling, bioinformatics, protein design, molecular modeling, or protein biotechnology? Contact me here!


www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, Science, technology, programming, etc. Become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here.


Related Articles