This new model could help expand the applicability of ML models for engineering proteins with desired functions by tuning their specific interactions with other molecule of any kind, thus effectively impacting biotechnology and clinical applications

After the revolution started by Deepmind’s AlphaFold in structural biology, the closely related field of protein design has more recently entered a new era of advancements through the power of deep learning. However, existing machine learning (ML) models for protein design have been limited in their ability to incorporate non-protein entities into the design process, handling only protein components. In our new preprint, we introduce a new deep learning model, "CARBonAra", that considers any kind of molecular environment surrounding the protein, and as such can design proteins that bind any kind of molecule: drug-like ligands, cofactors, substrates, nucleic acids, or even other proteins. By leveraging a geometric transformer architecture from our previous ML model, CARBonAra predicts protein sequences from backbone scaffolds while being aware of the restraints imposed by molecules of any nature. This groundbreaking approach could help to expand the versatility of ML models for engineering proteins with desired functions by tuning specific interactions with other cellular components of any kind.
Introduction
As data scientists, we are constantly striving to push the boundaries of what is possible. Protein design, that is the creation of new proteins with desired functions and properties, is such an area of action; in particular one with profound implications across various disciplines ranging from biology and medicine to Biotechnology and materials science. While physics-based methods have made progress in finding amino acid sequences that fold to a given protein structure, deep learning techniques have emerged as game-changers, significantly enhancing design success rates and versatility.
I recently discussed four modern ML models for protein design and engineering here:
The Era of Machine Learning for Protein Design, Summarized in Four Key Methods
While these models have found success in many protein design tasks, they are limited in their ability to consider non-protein entities during the design process -they just can’t handle them at all, a limitation that impacts their versatility and narrows their scope of application.
To overcome this challenge, we present in our latest preprint a new model called CARBonAra, that revolutionizes protein sequence design by accepting as inputs target protein scaffolds accompanied by any kind of interacting molecules. Here’s the preprint:
Context-aware geometric deep learning for protein sequence design
CARBonAra builds upon our Protein Structure Transformer (PeSTo), a geometric transformer architecture that operates on atom point clouds treating molecules agnostically in terms of atom types and representing them directly by elemental names. I described PeSTo in more detail earlier:
New preprint describes a novel parameter-free geometric transformer of atomic coordinates to…
CARBonAra’s core being based on the PeSTo model allows it to incorporate any kind of non-protein molecules, including nucleic acids, lipids, ions, small ligands, cofactors, or other proteins, into the process of designing a new protein. Thus, given an input protein structure with one or more ligands within interaction distance, CARBonAra predicts residue-wise amino acid confidences from whose maxima one can reconstruct protein sequences. For this, CARBonAra takes backbone scaffolds accompanied by non-protein molecules as inputs and generates a space of potential sequences that can be further constrained by specific functional or structural requirements -such as fixing certain amino acids, for example if they are known essentialy for a given function. CARBonAra offers an unprecedented level of flexibility and depth in protein design by considering the molecular context surrounding the protein of interest, which means it can craft regions specialized for binding ions, substrates, nucleic acids, lipids, other proteins, etc.
In our evaluations, CARBonAra performs on par with state-of-the-art methods like ProteinMPNN and ESM-IF1, while demonstrating similar computational efficiency -all being quite fast. The model achieves quite sequence recovery rates similar to those of ProteinMPNN and ESM-IF1 for the design of protein monomers and protein complexes, but on top of that it can handle protein designs that entail non-protein molecules, which none of the other methods can even handle.
One of the remarkable features of CARBonAra is its ability to tailor sequences to meet specific objectives by incorporating various constraints. For example, it can optimize sequence identity, minimize similarity, or achieve low sequence similarity. Moreover, by utilizing CARBonAra with structural trajectories from molecular dynamics simulations, we observed that we can improve sequence recovery rates, especially in cases where previous methods showed lower success rates.
To know more about the method, in particular the details of the ML architecture, check out our preprint in bioRxiv:
Context-aware geometric deep learning for protein sequence design
Some related articles on AI for structural biology
Over a year of AlphaFold 2 free to use and of the revolution it triggered in biology
A web app to design stable proteins via the consensus method, created with JavaScript, ESMFold…
"ML-Everything"? Balancing Quantity and Quality in Machine Learning Methods for Science
How Huge Protein Language Models Could Disrupt Structural Biology
www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, Science, technology, programming, etc.
Tip me here or become a Medium member to access all its stories (I get a small revenue without cost to you). Subscribe to get my new stories by email. Consult about small jobs on my services page here. You can contact me here.