|ARTIFICIAL INTELLIGENCE| LLM| GENE EDITING| AI IN MEDICINE|

Genes are like the story, and DNA is the language that the story is written in. – Sam Kean
Generative AI can create poems, code, blog posts, and more. All this by being trained with text. We often forget that text is a sequence of characters and they can be assembled in complex ways to achieve infinite and complex meanings. Similarly, life is composed of just a few basic characters (only 4 for DNA and 20 for proteins) and their infinite combinations have allowed us to have such incredible biodiversity today.
If we are made of sequences and language models can be capable of analyzing sequences why not exploit language models with DNA and protein sequences?
It is the basis of the revolution of the last two years. A revolution that began with AlphaFold2, in which by using a language model trained with protein sequences researchers were able to solve a problem that had been unsolved for 100 years. Today, thanks to AlphaFold2 we are able to reconstruct the structure of a protein from only a sequence of characters.
Speaking the Language of Life: How AlphaFold2 and Co. Are Changing Biology
The secret is that the model is able to learn on its own (self-supervised learning) a representation of the data that then allows it to perform tasks. In the case of proteins, the model learns a representation of the proteins and the patterns that are present within their sequence (these sequences like text sequences are not random but instead have functional meaning and unique semantics). This representation then allows us to predict the structure, and function of the protein or other parameters.
Here, we report that large protein language models learn sufficient information to enable accurate, atomic-level predictions of protein structure. (source)

One of the strengths of generative AI is the ability to leverage an LLM for tasks such as text generation. In addition, we can generate text that is contingent on specific requirements. For example, when we ask a model to generate a minimal function in Python to rotate an image, the model must generate text that must meet:
- functionality, the generated text (code) must perform the function we request accurately.
- Efficiency, the function must not be complex but the minimum number of possible steps.
- Syntactic correctness, the model must respect the rules of the language (Python).
All this is possible because the model learns an increasingly sophisticated representation within it. In fact, the first layers learn simple relationships between different parts of the text (syntactic structures, parts of speech, and so on) while the deeper layers learn complex patterns (irony, rhetorical figures, and so on). The model then can exploit these patterns when in inference we ask it to perform a task.

We can imagine a similar process for protein generation as well. If the model understands which parts of a protein’s sequence have a particular functional role or are responsible for a behavior, it can then exploit them in inference. For example, we could ask the model to generate a protein capable of cutting aromatic rings with a sequence of less than a hundred amino acids. This could be an enzyme that could be artificially produced and used to clean up oil-contaminated water.

This may sound like science fiction, but by exploiting a large language model, researchers have created functional proteins with sequences that do not exist in nature.
AI enables designing new proteins from scratch
Beyond structural features, masked protein language models capture biophysical properties, evolutionary context and alignment within families. (source)
This means that the model has captured this functional information and can exploit it to generate proteins conditioned by the desired biophysical properties.

DNA and proteins are not immutable but are the product of random mutations and the drive of natural selection. Every day, every living thing goes through mutations, some of which are beneficial and some that are deleterious. These mutations can then be passed on to offspring, and this is how species evolve. This process though is random and cannot be controlled. Moreover, many of these mutations when they occur are the cause of various diseases (genetic diseases, cancer, and more).
So far what we have seen is the possibility of exploiting a language model for computational tasks such as predicting the structure of a protein or generating proteins for artificial applications. Although the application implications are almost endless, this does not allow us to cure diseases.
Is it possible to mutate DNA to our advantage, and how can artificial intelligence help us?
Gene editing (the process in which DNA is modified) is actually a process that has been studied for decades. The fact that it is only now about to have an impact in the clinic and on patients shows how complex it is. In fact, editing DNA in a patient is technically complex (low yield) and risks being nonspecific (bringing mutations to places where we do not want and thus leading to disease).
Recently, though, some progress has been made. At present, cells from the patient are extracted ex vivo (usually hematopoietic cells), modified in the laboratory, and then reinfused into the patient. This has brought hope for blood diseases such as thalassemia and anemia.
, license: here](https://towardsdatascience.com/wp-content/uploads/2024/05/0kAAHCbot5Sy09ACy-scaled.jpg)
These successes were achieved through a new methodology that has revolutionized the possibility of modifying the DNA of human cells. CRISPR-Cas9 has simplified the work of researchers in enabling simple, robust, and compact editing.
The problem is that so far we have been successful in extracting blood cells, modifying them and re-injecting them. But this means we can neither modify all other organs nor reach solid tumors (modify genes in tumors specifically to cure them). This is because although we have several CRISPR-Cas proteins, they are often not optimal in body temperature, do not have the desired biochemical properties, are not selective enough, and so on.
CRISPR-based technologies are anticipated to contribute substantially to improve sustainable production, pathogen detection, curing of certain heritable genetic diseases, and food security. However, before the full potential of CRISPR-Cas can be exploited, there are still some hurdles to overcome: technical, commercial, and societal. source
Now, some researchers have tried to draw CRISPR-cas proteins manually or with the help of programs. The results haven’t been satisfactory, though, given the rugged and non-convex nature of the protein sequence landscape. The combinations of sequences are almost infinite but only a few are functional and have the desired properties.
As we said before, we could take a language model train it with protein sequences, then exploit it to generate proteins that have the desired properties. In fact, the transformer through self-attention learns which components of the sequence are important for a given function. Using in-context learning the model can recall what need to design a sequence for a particular function.
A Requiem for the Transformer?
Can we do use LLMs to have a desired CRISPR-cas? Can we generate a CRISPR-cas that allows to gene editing any organ or disease in the human body?
We have a generalist LLM capable of generating protein sequences of various types and functions that mirror the sequences of natural proteins. In this case, though, we want a specific model for a particular type of application and protein. For this, models can be fine-tuned for textual LLMs. In this work, they took a generalist model for proteins and adapted it for CRISPR-Cas by conducting fine-tuning on a dedicated CRISPR-Cas dataset.

The authors then generated protein sequences different from those that exist in nature. These sequences have similar structures to those that are known, but also both functional and structural differences. In other words, the model is as if it has "inspired itself" from naturally existing proteins to create new ones. The beauty of this approach is that these proteins can then be synthesized in the laboratory and tested for practical applications. The resulting proteins after testing exhibit different functional properties:
the model is capable of generating proteins with a variety of functional properties, including PAM specificity, temperature-dependent activity, DNA cleavage patterns, or high activity in human cells. (source)
In closing, biotechnology and medicine are on the verge of a revolution. Artificial Intelligence will have profound practical effects on healthcare, but while many applications are often discussed, the impact of LLMs on DNA and protein sequences is less discussed. AlphaFold2 and other similar LLMs help researchers understand the structure of a protein and thus design drugs. Since during training these models learn general rules about protein structure and function, they can also be used generatively to create new proteins.
On the one hand, these new proteins can be used for new applications (such as cleaning up the environment) so far it is less clear how they can be used to treat diseases. On the other hand, the combination of AI and CRISPR-Cas allows us to envision a future in which gene editing can be used to treat almost any disease.
Probably in the future, a doctor will sequence the genome at diagnosis. It will notice what mutations underlie the disease, and Gene Editing will be conducted to treat the patient. At present, the first clinical trials of gene editing are beginning. LLMs will allow new mutations and new gene editing measures to be identified.
Moreover, these LLMs are extremely flexible, and we can imagine applying the techniques we use for classical text-based LLMs. So in the future, these LLMs will have an arsenal of techniques (prompt design, fine-tuning, and so on) to generate proteins with desired functions that do not exist in nature.
What do you think? Do you think LLMs will revolutionize healthcare? Let me know in the comments.
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.
GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…
or you may be interested in one of my recent articles:
Tabula Rasa: Large Language Models for Tabular Data
Welcome Back 80s: Transformers Could Be Blown Away by Convolution
Reference
Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.
- Lin, 2022, Language models of protein sequences at the scale of evolution enable accurate structure prediction, link
- Voita, 2019, The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives, link
- Quintana, 2022, Gene Editing for Inherited Red Blood Cell Diseases, link
- Van der Oost, 2023, The genome editing revolution, link
- Ruffolo, 2024, Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences, link
- Ruffolo, 2024, Designing proteins with language models, link
-
Ferruz, 2022, Towards Controllable Protein Design with Conditional Transformers, link
- Verkuil, 2022, Language models generalize beyond natural proteins, link
- Jumper, 2021, Highly accurate protein structure prediction with AlphaFold, link
- Ghorbani, 2021, A short overview of CRISPR-Cas technology and its application in viral disease control, link
- Bhokisham, 2021, CRISPR-Cas System: The Current and Emerging Translational Landscape, link