
See also my two follow-up stories that touch on different aspects:
- AlphaFold-based databases and fully-fledged, easy-to-use, online AlphaFold interfaces poised to revolutionize biology
- The hype on AlphaFold keeps growing with this new preprint
Last week (July 2021) Deepmind’s peer-reviewed academic paper came out in Nature describing all the details of its CASP-winning AlphaFold v.2 program for predicting protein structures. At the same time, they released all its code open source at Github. In a matter of hours some scientists created Google Colab notebooks where everybody with a free Google account can run AF2 on its favorite protein, without downloading a single bit of its 2.2 TB and without requiring any hardware. STORY by a CASP assessor who uses these programs for research.
In December 2020 an organization called CASP that runs a "contest" on predicting the 3D shapes of proteins revealed that version 2 of Deepmind’s AlphaFold (AF2) had "won" by a quite ample difference above the runner ups. Many media said AF2 had solved a 50-year old problem. Although that was an exaggeration, it is true that AF2’s contribution to biology is enormous, promising to alleviate the burden of experimental protein structure determination and to speed up research in molecular biology. Despite these good news, scientists around the globe evoked negative sentiments pertaining to the purported inaccessibility of the technology, lack of details for reproducibility, and the anxiety to know if running AF2 was out of reach for academic budgets. But things couldn’t have been more different, and for good.
Last week (July 2021) Deepmind’s peer-reviewed academic paper came out in Nature describing all the details of its CASP-winning AlphaFold v.2 program for predicting protein structures. At the same time, they released all its code open source at Github. Some scientists still complained that the datafiles were too big (2.2 TB indeed). However, in a matter of hours some dedicated scientists created Google Colab notebooks where everybody with a free Google account can run AF2 on its favorite protein, without even downloading data and without the need of any special hardware. In fact all calculations happen on the cloud yet within a free colab space, which empowers users to fine-tune the runs. All this rounds up as one of the best ways to democratize access to the technology, after so much negative sentiment. Thanks Google!
Table of contents
- Some context: protein structures, experimental determination, and computer-based prediction of protein structures
- The Critical Assessment of Structure Prediction (CASP)
- Machine learning methods for protein modeling, Deepmind and its AlphaFold 1 program
- AlphaFold version 2
- Democratized access to AF2 through collaborative notebooks -entering a new era of molecular biology
Some context: protein structures, experimental determination, and computer-based prediction of protein structures
Proteins are biological "nanomachines" that consist in long, essentially linear, chains of amino acids that fold in 3D adopting what’s called a structure. There are 20 amino acids in nature, that in each protein get repeated in a unique way as dictated by each protein-coding gene. Getting to know the sequence of amino acids is easy as this can be inferred directly from the genome. But getting to know how they arrange in 3D space is hard. Moreover, each amino acid is like a small molecule made of several atoms; so, getting to know the 3D shape or "3D structure" of a protein entails determining the relative positions of all atoms of all its amino acids. For example, given the 20 canonical amino acids referred by the letters A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y the protein called ubiquitin consists in this sequence:
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
…that folds in 3D into this where each atom is shown as a sphere connected to other atoms by sticks, and where each amino acid (i.e. each letter in the sequence above) is shown in a different color:

However, notice that we usually simplify these graphics by showing only the trace that approximates the 3D shape:

These are the kinds of drawings you may have seen every time AlphaFold is under the spotlight.
Knowing the 3D shapes of proteins is essential in biology and any discipline related to it. To cite just 2 examples out of hundreds, enzymes of biotechnological use are in fact a specific class of proteins themselves. And the targets of the vast majority of compounds like antibiotics, chemotherapy agents and drugs are small molecules whose action relies in recognizing and binding 3D features of proteins.
Structural biology is the discipline that attempts to explain biological systems at the atomic level. This depends critically on the availability of structures of the molecules involved, the most important of which are usually proteins. While many structures can be determined by experimental techniques such as X-ray or neutron diffraction, nuclear magnetic resonance (NMR), or electron cryo-microscopy (also called cryo-electron microscopy, cryo-EM), there is also the alternative of predicting, or "modeling", these structures using computational methods. In fact the long-standing dream of structural biologists is to feed a protein sequence to a computer program and receive from it the 3D structure it adopts. This is not easy and isn’t fully solved yet, but after decades working on the problem the irruption of Machine Learning models (and of course the availability of large amounts of data) resulted in AF2 (and a couple other programs too, we must clarify) which can very accurately predict the 3D structures of many proteins for which is available.
Naturally, such predictions are essential for a large number of biological molecules that cannot be produced in the quantities and conditions needed for the various experiments. But predicting structures can also be useful for cases of molecules that may not be difficult to produce and manipulate during the experiments required to solve structures, but for which the amount of information provided by the structure does not justify the costs and time. Indeed, if we could predict structures of biomolecules with sufficient confidence, we could focus experiments only on particularly difficult systems (this is what structural genomics efforts are in fact pursuing) or on studying effects of perturbations in the structure, such as the effect of ligand binding to the protein under study. At the extreme, if we could predict all the physical chemistry of a given system at the atomic level, we could dispense with structure determination experiments altogether, and could concentrate efforts directly on understanding mechanisms and everything that this knowledge allows us to do: developing new pharmaceuticals, designing new enzymes, understanding evolution, …
The Critical Assessment of Structure Prediction (CASP)
Given the impact that structure predictions can then have on structural biology, generations of researchers have been working on the problem since the middle of the last century, especially for proteins since they possess a much greater structural value and variety than other biological macromolecules. Many methods have been developed, which can be classified into two main groups. On the one hand, those that use already known structures to try to predict protein structures of similar sequence or folding, which is known as "homology modeling". On the other hand, those methods that attempt to "fold" sequences independently of any homology with other proteins of known structure, for example by using simulations based on basic physicochemical principles or by using information about the structure of small peptide fragments and/or residue contacts.
The big problem that arose along with methods, programs and experts in structure modeling is how to evaluate the quality of these predictions. In the early 1990s the Critical Assessment of Structure Prediction, or CASP, was born as an organization whose goal is to provide constant monitoring and evaluation of the available methods for predicting protein structures. The contest (actually nobody wants to call it a contest, but for many this is what it is!) takes place every 2 years, during which the organizers collect experimental structures that have not been published in the Protein Data Bank. The organizers provide the amino acid sequences of these proteins to the predictor groups, who after some time send their predictions to the organizers. Then a group of assessors independent of the organization and not participating as predictors evaluates the models provided by the predictors in comparison with the experimental structures, to which only the assessors have access. Each competition ends with a series of papers that describe the difficulty presented by the targets, describe the quality of the models provided by the predictors, generate an "official" ranking of the predictors, and discuss the "state of the art" of modeling, especially which methods worked, which structural issues were especially difficult to predict, etc. Between In 2016 and 2019 CASP12 and CASP13 took place, for which this author was an assessor in the main track of the competition which focuses on the prediction of difficult targets. These two CASPs started to reveal the disruptive power of machine learning methods applied together with coevolution techniques, by the hand not only of AlphaFold 1 (who was the "winner" of CASP13) but also of several academic groups that had been setting the basis for these technologies in the preceding years.
To know more about CASP see this other story. To see the CASP12 and CASP13 papers (technical peer-reviewed articles) see the publications about protein modeling that I list here under Molecular modeling and sequence analyses.
Machine learning methods for protein modeling, Deepmind and its AlphaFold 1 program
In December 2018 CASP revealed that the first-ever private player had "won". This was Deepmind, from the Alphabet (Google) group. The first version of Deepmind’s program AlphaFold ranked first, substantially above the runner ups but not stellar. Moreover, although it had invented and implemented some new ideas, at its core it was mainly pushing to the limits what academics had been doing in the last years. Among the key elements, analyzing not only the entry sequence but also a series of related sequences (a "sequence alignment") that correspond to proteins that achieve similar 3D folds through slightly different sequences. The point is that during evolution amino acids change, but pairs of amino acids that are close in space must retain some affinity, introducing couplings in the evolutionary patterns. What academics had been analyzing already for a decade was how to extract these couplings from sequence alignments. By CASP12 they were starting to do this right, bringing about a small improvement in structure prediction. By CASP13 many groups were doing it, but a few academic groups and AF1 also used this information to predict not only pairs of amino acids that are in contact but also the distances between them, and even their relative orientations. These intermediate predictions were then used by the different programs, including AF1, to model the 3D structure of the protein.
AF1 was the best for most proteins evaluated in CASP13, and you probably saw the hype in the media. But it wasn’t really disruptive, like AF2 is.
AlphaFold version 2
In December of 2020 the results of CASP14 were revealed (online due to the pandemic, so I could be there as a former assessor!) showing that Deepmind had won again, but this time by far and practically "solving" some (not all like you see in the media, just some) of the key problems in protein structure prediction.
Like other programs, AF1 worked as kind of disconnected modules. One main module analyzed the input sequence and alignment to predict distances and orientations between pairs of amino acids, and then another module used these distances and orientations as restraints to "fold" the linear sequence of amino acids into a predicted 3D structure. AF2 is not a tuning of AF1 but a complete redesign where everything from input to output runs through a single model connecting all the way from sequences to predicted 3D structures. That means the network "knows" all the physics connecting sequence to structure. Details in their peer-reviewed article, especially in its supporting information. And at an intermediate level here in Carlos Outeiral’s post.
The following plot summarizes how much progress has happened in CASP since it started, over a quarter century ago (I got to know about CASP by 2002, when I was just learning about protein structure!). After years of rather poor prediction capability (note this plot is only for the hard targets!) you can see a clear improvement from CASP11 to CASP12 (no AlphaFold yet), then another jump from CASP13, not only due to AF1 but also in the whole community, and then the latest improvement in CASP14, where academics also improved but not as much as Deepmind which broke the barrier of quality. Moreover, its predictions are consistently good, as you see in the rather low dispersion.

AF2’s predictions are so good that it not only gets the overall shape right but also the precise positions of most atoms. In the example protein graphics I showed above, this means not only guessing correctly where the trace goes but actually the positions of all the atoms that make each amino acid. It got right some very large proteins (usually bigger proteins are harder to model). It even got right some protein-protein complexes, which is a whole different track in CASP, still lagging behind.
AF2’s predictions for CASP14 were so good that some of their models were used to complete experimental structure determination of a few CASP14 targets. This had happened only a few other times in previous CASPs.
Last, AF2 is not only a robust predictor of protein structures, but also a good predictor of the quality of its known structures. This is key in the field of protein structure prediction, and highly overlooked both by developers and users. A model with associated quality estimations is much richer than a presumably good model for which you lack any formal quality estimate. All the serious groups provide such estimates, and in the case of AF2 the Deepmind people got sure that their quality indicators were good. AF2 predicts a quality estimate called LDDT, which CASP uses to compare models to experimental structures amino acid by amino acid. This way AF2 tells you how it is about the different regions of the models produced.
Democratized access to AF2 through collaborative notebooks -entering a new era of molecular biology
Critics and skepticism followed instantly after CASP14 revealed the hit of AF2: "academia cannot compete with such giants", "nice that they made it but we won’t be able to use it", and "for sure they won’t open it up for use" were some of the common reactions. But things couldn’t have been more different., for good.
Last week the academic paper describing the full results of this huge machine learning model is out together with all free open-source code hosted at Github. Moreover, engaged researchers (to whom I’m very thankful) built Google Colab pipelines where you can use AF2 even from your phone.
The AF2 downloads from Github weigh 2.2 TB, and the model is to be run on GPUs. But by using Google Colab you don’t need to download the software, and you don’t need any powerful GPUs. You just run everything on the cloud. And you don’t even need to know about cloud computing!
The Colab notebooks put together by the two researchers with Twitter accounts above handle everything from loading libraries and inputting your protein’s sequence to building its alignment, finding homologs of known structure ("templates", which of course help a lot to model related proteins), running AF2 and displaying the results: 5 models that you can see in 3D right in the browser, and plots of estimated LDDT against sequence. Furthermore, you can in principle fork any of these notebooks and make your own edits to adapt the runs to more specific tasks.

I have already made several tests using these notebooks, already withdrawing some conclusions. The most important one is that sequence alignments and templates both do help a lot to get better models. I have seen many tweets about people using the tool, but many overlooking the LDDT estimation plots. These plots are essential!
The hype on the great quality of AF2’s predictions is to me much exacerbated by the possibility of actually using it, and so easily. This brings potential to researchers all over the world, paving ways for extensive benchmarks of the limitations and possibilities of AF2. Modeling protein structures is essential for biologists working with proteins unamenable to experimentation. As explained in the introduction, good protein models are useful even when you do have some data but cannot make proper use of it. Without going into details, an example is the ability to phase X-ray data by molecular replacement. Another big area is using protein models to complete experimental volumes, for example from mid-resolution Cryo-EM, with atomic coordinates. And the future is even brighter, as Deepmind itself gets into other problems of biology (no plans revealed but we can guess protein-protein interactions first, to then go for small-molecule design) and also as academics can hands-on profit from AF2 applications and from all the knowledge now made public.
Work in biology has relied on computers and classical software for a long time. Now it is the time of AI.
More links and reads
See my follow-up story in TDS Editors: https://towardsdatascience.com/alphafold-based-databases-and-fully-fledged-easy-to-use-alphafold-interfaces-poised-to-baf865c6d75e
And also this follow-up story in TDS Editors on the combination of AlphaFold2 with calls to a superb sequence alignment builder, MMSeqs2, to give users the full power of this technology on simple Colab notebooks: https://towardsdatascience.com/the-hype-on-alphafold-keeps-growing-with-this-new-preprint-a8c1f21d15c8
Google colab notebooks can also be used to run molecular dynamics simulations, see this story TDS Editors: https://towardsdatascience.com/new-preprint-describes-google-colab-notebook-to-efficiently-run-molecular-dynamics-simulations-of-9b317f0e428c
Blog post by Carlos Outeiral at Oxford Protein Informatics Group: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics. Very interesting first thoughts and recap of what happened in CASP14.
Blog post by Carlos Outeiral at Oxford Protein Informatics Group: what’s behind the structure prediction miracle. With more detail than what I gave you here, yet simpler than the Nature paper.
Liked this article and want to tip me? [Paypal] -Thanks!
I am a nature, science, technology, Programming, and DIY enthusiast. Biotechnologist and chemist, in the wet lab and in computers. I write about everything that lies within my broad sphere of interests. Check out my lists for more stories. Become a Medium member to access all stories by me and other writers, and subscribe to get my new stories by email (original affiliate links of the platform).