The world’s leading publication for data science, AI, and ML professionals.

Beyond AlphaFold: The Future Of LLM in Medicine

AlphaFold leaves a complex legacy: What will be the future of LLM in biology and medicine?

|ALPHAFOLD3|LLMs|LLMs & MEDICINE|

image done by the author using AI
image done by the author using AI

In real open source, you have the right to control your own destiny. – Linus Torvalds

AlphaFold3 arrived but was not received as triumphantly as they would have expected at DeepMind. In any case, this is another episode of the revolution that biology is facing. LLMs are revolutionizing both medicine and the pharmaceutical field. Three years after the release of AlphaFold2, computational biology has changed, and it is time for some considerations.

  • Why does AlphaFold3 mark both for better and for worse a turning point?
  • Why was the research community so disappointed?
  • How is the wind changing? How is the community responding?
  • What does the future hold for LLMs in biology?

We discuss this in this article.


The arrival of AlphaFold3

When AlphaFold2 was announced it seemed like the dawn of a revolution. For almost a century, predicting the structure of a protein from its structure seemed a task beyond our current technical capabilities. For nearly fifty years computational biology has been trying to understand the elusive rules that decide what a protein takes on.

In 2021, AlphaFold2 seemed to have solved the problem. Indeed, the model achieved accuracy in predictions that had never been seen before. The community was excited by its potential and future impact on both research and industry.

Speaking the Language of Life: How AlphaFold2 and Co. Are Changing Biology

Predictions with AlphaFold2. image source: here, license: here
Predictions with AlphaFold2. image source: here, license: here

A few days ago, the scientific community got excited again: DeepMind published AlphaFold3.

The first question one might ask is: why if the protein-folding problem has been deemed as solved do we need a new AlphaFold?

Actually, protein-folding prediction is just the beginning.

AlphaFold2 was revolutionary because it has demonstrated great accuracy in predicting the structure of single-chain proteins. However, this is a special case of both the protein folding problem and the scientific interest in understanding protein structure.

Since its publication, several groups have been interested in extending the capabilities of AlphaFold2. For example, a new version called AlphaFold-Multimer was presented later to conduct predictions on protein complexes.

Structure examples predicted with the AlphaFold-Multimer. Image source: here
Structure examples predicted with the AlphaFold-Multimer. Image source: here

These successes lead to the question of whether it is possible to accurately predict the structure of complexes containing a much wider range of biomolecules, including ligands, ions, nucleic acids, and modified residues, within a deep learning framework – source

One of the limitations of AlphaFold2 was its inability to predict interactions. This is a major limitation since proteins are the engine in a vibrant ecosystem composed of continuous interactions (with each other, which themselves and, with many other macromolecules). Moreover, developing drugs requires understanding how a drug interacts with biological molecules. Designing drugs with high affinity to their target protein is one of the goals of computational chemistry (while unwanted interactions are often the cause of side effects). Therefore, it was natural to imagine that AlphaFold3 was geared toward this goal.

AlphaFold2 Year 1: Did It Change the World?

Certainly, AlphaFold3 is a technical masterpiece. It reduces the use of multi-sequence alignment (the need to use similar examples to predict protein interactions) and introduces a new Diffusion Module for structure prediction. In other words, they have simplified AlphaFold2 while improving the overall performance. Since using a generative diffusion approach is prone to hallucination, this wasn’t an easy task.

AlphaFold3 integrates elements from models that have been developed to predict specific interactions (so you can see it as a generalization or a single model that predicts different types of interactions). In addition, the resolution of the model is much higher.

AlphaFold3, the researchers found, substantially outperforms existing software tools at predicting the structure of proteins and their partners. – source

To summarize, AlphaFold3 performs better than a wide range of software and models exploited by researchers in finding new drugs. This is all thanks to a single model.

Despite the sophisticated technique used to develop it, one question remains:

Will AlphaFold3 have the same disruptive impact that its previous version had?

The disappointment of the community

"We have to strike a balance between making sure that this is accessible and has the impact in the scientific community as well as not compromising Isomorphic’s ability to pursue commercial drug discovery." – Pushmeet Kohli, DeepMind’s head of AI science, source

AlphaFold3 can be accessed on a dedicated server. Just upload the sequence and 10 minutes later you get the prediction. Predictions are limited to 10 per day. Also, you cannot get structures of proteins bound to possible drugs. The usage is also restricted: only non-commercial applications are allowed.

When AlphaFold2 was published the entire code was made available to all researchers. But AlphaFold3 comes with ‘pseudocode’, no matter how detailed it is it does not make it easy to reproduce the model.

Traditionally, the most prestigious research journals only accept articles that agree to publish the code (or at least promise to release on request). There has always been ambiguity but in this case, it was clear from the beginning that it would not be released.

This has sparked outrage from the scientific community, culminating in an open letter of criticism:

We were disappointed with the lack of code, or even executables accompanying the publication of AlphaFold3 in Nature. Although AlphaFold3 expands AlphaFold2’s capacities to include small molecules, nucleic acids, and chemical modifications, it was released without the means to test and use the software in a high-throughput manner. – source

Unexpectedly, Nature responded to the criticism with an editorial explaining its reasons for agreeing to publish AlphaFold3 without either the code or the model being made available.

Despite justifications from both Google and Nature, the scientific community has not been satisfied. After all, the restrictions placed by Google and its non-release, are a serious limitation to both its possible impact and usefulness. It is difficult to assess how good the model is, and its non-release does not allow it to be used for new applications like its predecessor.

Perhaps the criticism has had an impact, so much so that DeepMind has committed to releasing the model for academic use. But by now the damage has been done.

So now, we have a betrayed community. The same community that celebrated AlphaFold2 and its release, and now was anxiously awaiting the new version…

How will this community respond? What will happen in the future?


The Future of Biology LLMs

AlphaFold2 started a revolution, but he is not the only hero. Soon after, AlphaFold2 other models capable of accurately predicting the structure of a protein were published. In fact, other large companies have also become interested in the topic (e.g., META with ESMfold and Salesforce’s ProGen). These models were released as open-source and quickly adopted by the scientific community.

Highlighted ESMFold structure predictions, comparison to AlphaFold2, and comparison to closest PDB structure. image source: here
Highlighted ESMFold structure predictions, comparison to AlphaFold2, and comparison to closest PDB structure. image source: here

Although they are slightly less accurate than AlphaFold2, their impact has been remarkable. This is mainly for two reasons. They are easier to use (partly because they are two transformers and not the complicated architecture of AlphaFold2) and are also available on HuggingFace. Second, they are much faster. Finally, ESMfold is much less expensive to use in inference than AlphaFold2.

ESMFold vs AlphaFold2. image source: here
ESMFold vs AlphaFold2. image source: here

AlphaFold2 is a technical marvel, but it is a real pain for any application outside of predicting the structure of a protein (as anyone who has thought of using it outside its server knows). Conducting fine-tuning requires resources and expertise. The other two models, on the other hand, are much lighter and easier to fine-tune.

And researchers have noticed that.

Especially when it comes to creating additional applications where you have to modify the model or conduct some fine-tuning. In biomedical science, something is truly successful only if it is adopted by the community: copied, used, modified, and adapted.

How LLMs Can Fuel Gene Editing Revolution

This shows that researchers are eager to use these models and are also interested in creating innovative applications from them. If he does not find them useful or easy to use, they are abandoned and forgotten without regret (and GitHub is a huge graveyard of bioinformatics libraries that are no longer maintained).

But then why so far even open-source models are only those released by large companies?

Training these models is quite expensive. Training a model like AlphaFold3 can cost up to a million dollars in cloud computing. Having the code is only the easy part, you have to download and process complex datasets, not to mention the technical challenges of the training itself.

The cost of these models and also the training, however, are ultimately manageable aspects. Some institutions can manage to cover the cost, and the open-source community has succeeded even in more complex projects. But there are two aspects to consider. The first is that AlphaFold2 left the community bewildered; few thought that a computational model could have those results in protein folding. Second, traditionally protein folding and computational chemistry are very experimentally related environments.

Today, however, academic research is much better prepared, and many more researchers have the computational skills to be able to train an open-source model. Therefore, the chase has already begun for those who can recreate AlphaFold3 in an open-source manner. OpenFold consortium is already working to develop an OpenFold3 model (the open-source version of the DeepMind model). They are not the only ones; the University of Washington is also working on the same idea, and there are such independent projects as lucidrains.

There is more than just recreating AlphaFold3

Several open-source models oriented toward health domains have come out in recent months. Researchers have fine-tuned these models to obtain LLMs able to generate new sequences. Models that also represent original solutions such as Protein-Language models (models that combine Large Language Models and protein models).

Examples of protein-to-text generation tasks. Proteins are represented by sequences of amino acids. image source: here
Examples of protein-to-text generation tasks. Proteins are represented by sequences of amino acids. image source: here

Another interesting application of LLM in biology is ScGPT, a model that focuses on learning a representation of single-cell data. This can be used for target discovery (one of the steps in the drug design process)

image source: here
image source: here

There are also other examples of interesting applications or trained models for specialized biological/medical tasks such as models for histology, gene regulation, gene expression, gene pathways, and much more. Researchers are even experimenting with other types of LLM like space state or RNN-based.

These models show us some trends for the future. Open-source models are being used by researchers for applications in medicine and biology. A the beginning, models such as AlphaFold2 and ESMfold are used in inference to predict the structure of proteins. These structures were then used by biologists in their publications or for hypothesis generation. In the second phase, the researchers fine-tuned these models when they wanted to adapt them to a specific task. Or they combined models when they needed a different model (and the available models did not have the capabilities).

Now we are entering a new phase. Models such as scGPT show that the community is no longer limited to either simply using model inference or simply modifying (fine-tuning, combining) them. Rather, it develops new models from scratch when previous models cannot solve a given task.

These are the preconditions for a revolution. Researchers do not have to wait for some big company to develop new models (and hope they release them in open-source), they can develop them themselves. There are many applications in the biomedical field for which this is useful. So there will be an explosion of specialized models in research applications, many of them trained from scratch. Large companies will continue to focus on foundation models, but now researchers have the expertise to create models even for applications outside the interest of large companies.

In summary, companies like Google and META will continue to train models that predict structure (we will probably see an AlphaFold4 in some time). A few well-funded groups will try to create an open-source alternative (such as RoseTTAFold). There are already (and in the future more and more) researchers or biotechs who will either adapt published models (fine-tuning, modification, feature extraction, ensembling) or create small models from scratch.

This shows that AlphaFold2 was the spark that ignited the scientific community, but the community is ready to continue the revolution.


Parting thoughts

AlphaFold3 will serve as a "cautionary tale" to academics about the perils of relying on technology companies such as DeepMind to develop and distribute tools such as AlphaFold. "It’s good they did it, but we shouldn’t depend on it," AlQuraishi says. "We need to create a public-sector infrastructure to be able to do that in academia." – source

Predicting the structure of a protein has enormous application implications in the biomedical field. It can be used for almost endless applications: understanding the interaction between pathogen and host, development of new and more efficient drugs, devising new vaccines, and understanding molecular changes caused by cancer. This revolution also extends other fields such as creating enzymes to reduce environmental pollution or producing crops capable of resisting global warming. If this revolution remains in the hands of a few companies, we will see only a fraction of its potential benefits.

AlphaFold2 has demonstrated the feasibility of being able to predict a structure from its sequence. It is not the end point of research because it has several limitations. AlphaFold3 fills some of them but the fact that it is not accessible neither limits its usefulness.

This revolution consists of three phases:

  • In the first phase, researchers used LLMs as they are. Only in inference and in most cases through dedicated servers. This use was mainly to generate new scientific hypotheses.
  • In the second phase, the models were modified or adapted for special cases. Researchers conducted fine-tuning of the models with their proprietary data or extracted the representation to use for other applications. With more models available, the community began to combine them into increasingly complex pipelines.
  • In the third phase, several groups began to create models from scratch. There are many applications for which it is cheaper to train a model from scratch than to modify a model trained for something else.

Each of these phases means an increase in both expertise and resources devoted to using LLMs in biology. If the community was initially displaced by the impact of AlphaFold2, it quickly internalized expertise and began to invest in LLMs. The results are that today many researchers create ad hoc models for their needs.

This new expertise enables researchers to create new LLMs dedicated to their expertise. For example, researchers who specialize in monoclonal antibodies will be able to build LLMs to generate specific sequences of new antibodies. They can either conduct fine-tuning of an existing model (as d[one](https://arxiv.org/abs/2405.12564v1) for CRISPR), combine it with a language model for conditional generation (a similar approach to this one), or create it from scratch. Having obtained the model they can quickly test the results in the lab and if necessary refine the training. This model diversification will have experimental spin-offs, and the new data generation will be used to train future models.

Certainly, big companies like Google and META are investing in biology LLM. However, there is a whole ecosystem of biotech and academic labs that can make the full potential of using language modeling in biology a reality. This also requires open-source models that can be freely used by the community. Today the scientific community is much more ready for the challenge, and in the future open-source models will dominate the scene. This will allow for different models in both structure and function because expertise is much more prevalent today.


TL;DR

  • AlphaFold2 has marked a turning point in showing how LLMs can be impactful in biology and medicine.
  • Alphafold2 also made such an impact because it was released in open source.
  • AlphaFold3 will not have the same impact on the community because it is a closed source and therefore cannot be used freely by researchers.
  • Open-source models allow the community to create new applications, be able to combine them or use them to generate new hypotheses
  • The research community had been mainly passive (had used the published models without modifying them). Today, however, there are more and more published models even from independent groups, small biotechs, and universities. This shows that a strong open-source community is also active.
  • In the future, only LLMs that are open-source will be truly adopted by researchers. They will not only be used for research in medical and biological fields but also modified and adapted for endless applications. The others will be abandoned instead.

What do you think? What do you see as the possible contribution of LLM in biology? Let me know in the comments


If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Clear Waters: What an LLM Thinks Under the Surface

HippoRAG: Endowing Large Language Models with Human Memory Dynamics

Graph ML: A Gentle Introduction to Graphs

Can a LLM Really Learn New Things

Reference

Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.

  1. Jumper, 2021, Highly accurate protein structure prediction with AlphaFold, link
  2. Abramson, 2024, Accurate structure prediction of biomolecular interactions with AlphaFold 3, link
  3. AlQuraishi, 2021, Protein-structure prediction revolutionized, link
  4. Evans, 2021, Protein complex prediction with AlphaFold-Multimer, link
  5. Lin, 2022, Language models of protein sequences at the scale of evolution enable accurate structure prediction, link
  6. Nature, 2024, AlphaFold3 – why did Nature publish it without its code? link
  7. Callaway, 2024, Major AlphaFold upgrade offers boost for drug discovery, link
  8. Yang, 2023, AlphaFold2 and its applications in the fields of biology and medicine, link
  9. Bertoline, 2023, Before and after AlphaFold2: An overview of protein structure prediction, link
  10. Xu, 2023, Toward the appropriate interpretation of Alphafold2, link
  11. Madani, 2023, Large language models generate functional protein sequences across diverse families, link
  12. Ruffolo, 2024, Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences, link
  13. Liu, 2023, ProtT3: Protein-to-Text Generation for Text-based Protein Understanding, link
  14. Nechaev, 2024, Hibou: A Family of Foundational Vision Transformers for Pathology, link
  15. Cui, 2023, scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI, link
  16. Della Torre, 2023, The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics, link
  17. Avsec, 2021, Effective gene expression prediction from sequence by integrating long-range interactions, link
  18. Schiff, 2024, Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling, link

Related Articles