The world’s leading publication for data science, AI, and ML professionals.

Breaking boundaries in protein design with a new AI model that understands interactions with any…

This new model could help expand the applicability of ML models for engineering proteins with desired functions by tuning their specific…

Scheme outlining what this new deep-learned model can do: compute amino acid probabilities for protein design starting from a target protein backbone surrounded by other molecule(s) within binding distance (here exemplified with the green molecule on top). Picture produced by the author.
Scheme outlining what this new deep-learned model can do: compute amino acid probabilities for protein design starting from a target protein backbone surrounded by other molecule(s) within binding distance (here exemplified with the green molecule on top). Picture produced by the author.

This new model could help expand the applicability of ML models for engineering proteins with desired functions by tuning their specific interactions with other molecule of any kind, thus effectively impacting biotechnology and clinical applications

Concept art on "protein engineering" created by the author by editing Dall-E-2 generations (originally used here).
Concept art on "protein engineering" created by the author by editing Dall-E-2 generations (originally used here).

After the revolution started by Deepmind’s AlphaFold in structural biology, the closely related field of protein design has more recently entered a new era of advancements through the power of deep learning. However, existing machine learning (ML) models for protein design have been limited in their ability to incorporate non-protein entities into the design process, handling only protein components. In our new preprint, we introduce a new deep learning model, "CARBonAra", that considers any kind of molecular environment surrounding the protein, and as such can design proteins that bind any kind of molecule: drug-like ligands, cofactors, substrates, nucleic acids, or even other proteins. By leveraging a geometric transformer architecture from our previous ML model, CARBonAra predicts protein sequences from backbone scaffolds while being aware of the restraints imposed by molecules of any nature. This groundbreaking approach could help to expand the versatility of ML models for engineering proteins with desired functions by tuning specific interactions with other cellular components of any kind.

Introduction

As data scientists, we are constantly striving to push the boundaries of what is possible. Protein design, that is the creation of new proteins with desired functions and properties, is such an area of action; in particular one with profound implications across various disciplines ranging from biology and medicine to Biotechnology and materials science. While physics-based methods have made progress in finding amino acid sequences that fold to a given protein structure, deep learning techniques have emerged as game-changers, significantly enhancing design success rates and versatility.

I recently discussed four modern ML models for protein design and engineering here:

The Era of Machine Learning for Protein Design, Summarized in Four Key Methods

While these models have found success in many protein design tasks, they are limited in their ability to consider non-protein entities during the design process -they just can’t handle them at all, a limitation that impacts their versatility and narrows their scope of application.

To overcome this challenge, we present in our latest preprint a new model called CARBonAra, that revolutionizes protein sequence design by accepting as inputs target protein scaffolds accompanied by any kind of interacting molecules. Here’s the preprint:

Context-aware geometric deep learning for protein sequence design

CARBonAra builds upon our Protein Structure Transformer (PeSTo), a geometric transformer architecture that operates on atom point clouds treating molecules agnostically in terms of atom types and representing them directly by elemental names. I described PeSTo in more detail earlier:

New preprint describes a novel parameter-free geometric transformer of atomic coordinates to…

CARBonAra’s core being based on the PeSTo model allows it to incorporate any kind of non-protein molecules, including nucleic acids, lipids, ions, small ligands, cofactors, or other proteins, into the process of designing a new protein. Thus, given an input protein structure with one or more ligands within interaction distance, CARBonAra predicts residue-wise amino acid confidences from whose maxima one can reconstruct protein sequences. For this, CARBonAra takes backbone scaffolds accompanied by non-protein molecules as inputs and generates a space of potential sequences that can be further constrained by specific functional or structural requirements -such as fixing certain amino acids, for example if they are known essentialy for a given function. CARBonAra offers an unprecedented level of flexibility and depth in protein design by considering the molecular context surrounding the protein of interest, which means it can craft regions specialized for binding ions, substrates, nucleic acids, lipids, other proteins, etc.

In our evaluations, CARBonAra performs on par with state-of-the-art methods like ProteinMPNN and ESM-IF1, while demonstrating similar computational efficiency -all being quite fast. The model achieves quite sequence recovery rates similar to those of ProteinMPNN and ESM-IF1 for the design of protein monomers and protein complexes, but on top of that it can handle protein designs that entail non-protein molecules, which none of the other methods can even handle.

One of the remarkable features of CARBonAra is its ability to tailor sequences to meet specific objectives by incorporating various constraints. For example, it can optimize sequence identity, minimize similarity, or achieve low sequence similarity. Moreover, by utilizing CARBonAra with structural trajectories from molecular dynamics simulations, we observed that we can improve sequence recovery rates, especially in cases where previous methods showed lower success rates.

To know more about the method, in particular the details of the ML architecture, check out our preprint in bioRxiv:

Context-aware geometric deep learning for protein sequence design

Some related articles on AI for structural biology

Over a year of AlphaFold 2 free to use and of the revolution it triggered in biology

A web app to design stable proteins via the consensus method, created with JavaScript, ESMFold…

"ML-Everything"? Balancing Quantity and Quality in Machine Learning Methods for Science

How Huge Protein Language Models Could Disrupt Structural Biology


www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, Science, technology, programming, etc.

Tip me here or become a Medium member to access all its stories (I get a small revenue without cost to you). Subscribe to get my new stories by email. Consult about small jobs on my services page here. You can contact me here.


Related Articles