Graph Convolutional Neural Networks to Analyze Complex Carbohydrates

Using PyTorch Geometric to Work With Biological Data

Daniel Bojar
Towards Data Science

--

A glimpse at the diversity of complex carbohydrates. Source: Daniel Bojar

Graph convolutional neural networks (GCNs) have attracted increasing amounts of attention over the last couple of years, with more and more disciplines finding use for them. This has also been extended into the life sciences, as GCNs have been used to analyze proteins, drugs, and of course biological networks. One key advantage of GCNs that has enabled this expansion is their ability to natively work with nonlinear data formats, in contrast to more linear data structures such as in natural languages. Because of this feature, we also implemented GCNs for our own topic of interest, the study of complex carbohydrates or glycans.

Glycans are ubiquitous in biology, decorating every cell and playing key roles in processes such as viral infection or tumor immune evasion. They are also extraordinarily diverse biological sequences, consisting of hundreds of unique building blocks, compared to the twenty for proteins or the four for DNA/RNA, which can also be combined in several different configurations in a growing glycan chain. Lastly, glycans are the only nonlinear biological sequence, naturally forming extensive branches, which can themselves further branch. They are therefore part of a subclass of graphs, namely trees. This makes glycans prime candidates for an appropriate application of GCNs in biology.

Previously, we developed techniques to analyze glycan sequences by treating them as a sort of biological language. We used a recurrent neural network setup to work around the nonlinearity of glycan sequences to predict their immunogenicity, contribution to pathogenicity, and taxonomic origin. This worked well to an extent, surpassing baselines such as using a random forest based on motif frequencies. Yet we believed that more powerful algorithms, that would be capable of accommodating the tree structure of glycans, would improve existing applications and enable new approaches in the study of glycans. This is why we turned to GCNs to set a new state of the art for analyzing glycans.

An overview of using graph convolutional neural networks in the analysis of glycans. Source: https://www.cell.com/cell-reports/fulltext/S2211-1247(21)00616-1

GCNs learn relations in graphs (or trees) by characterizing nodes via their neighbors in the graph, or more precisely via the features of neighboring nodes. In our case, we view monosaccharides (the glycan building blocks such as glucose or galactose) as well as their connecting linkages as nodes. While it may seem more natural to view monosaccharides as nodes and linkages as edges, we decided against this to accommodate short but important glycans that only consist of one monosaccharide and one linkage. To let our GCN learn the features of node neighborhoods, we first implemented a node embedding, so that each monosaccharide and linkage type were represented via embedding features that could be learned by our model and used in characterizing node neighborhoods. To best express the rich diversity of glycans, we used a 128-dimensional embedding for this purpose.

Next, we had to choose the graph kernel which we would use to perform the graph convolutions with. This process is the abovementioned procedure of learning a node via its neighbors and their features (in our case, the embedding features of the node types). After testing various graph kernels, we ended up with k-dimensional graph neural network operators, which are inspired by the Weisfeiler-Leman algorithm to test graph isomorphism and which showed the best performance on our datasets. Now, the great thing about GCNs is that you can have multiple graph convolutional layers within a single model. This allows you analyze the graphs / glycans at different levels of granularity. While the first layer may only consider directly connected nodes for its analysis, subsequent layers can extend this so-called receptive field and consider the relation of a node with nodes that are further removed in the graph. In our case, we chose a model with three of these layers as the best-in-class model.

SweetNet Model architecture. Source: https://www.cell.com/cell-reports/fulltext/S2211-1247(21)00616-1

This approach of consecutive graph convolutional layers allows the model to learn graph neighborhoods and even characteristic motifs that may be predictive for downstream classification tasks in a supervised setting. To summarize the learned features from these steps, we used pooling layers that condense the salient information from the graph convolutional layers for subsequent layers. After each graph convolutional layer, we first use a topk pooling layer which projects the graph to a smaller graph based on a learned projection score. Then, we concatenate results from both global mean pooling and global average pooling operations.

This final graph representation, across the three graph convolutional layers, is then routed through a fully connected neural network to arrive at a final prediction for the respective task. Next to the standard (leaky) ReLU, dropout, and batch normalization set-up, we also include a so-called boom layer into this part. Normally, the dimensionality of the representation after the convolutions is slowly decreased in this final part toward the low-dimensional model output. Yet a boom layer transiently increases dimensionality (the opposite of a bottleneck) to allow the model escape local minima and improve performance. We gave this final model the name SweetNet, as a homage to the more conventionally known and loved type of carbohydrates. And now we can have a look at the fun things you can do with a GCN for glycans!

While constructing SweetNet, we ensured that our predictive performance was superior to previously reported architectures, such as the recurrent neural network mentioned above, on all reported tasks. One of these tasks was the prediction whether a glycan sequence would be recognized by the human immune system. This is relevant as glycans can be very immunogenic, such as in the case of allergens or mismatched blood groups, but also immunosuppressive, such as in the case of tumor immune evasion. On our dataset, SweetNet achieved a test set accuracy of ~95%, purely based on glycan sequences. We then extracted the graph representations of these sequences learned by a SweetNet model trained on predicting glycan immunogenicity, immediately subsequent to the graph convolutional layers. When visualizing this, it becomes immediately obvious that the model has learned to separate the two classes of immunogenic / nonimmunogenic glycans. What’s more, within the nonimmunogenic glycans, a fine-structure is visible that is reminiscent of the different categories of human glycans (which of course share sequence similarity within a category). Both glycolipids and O-glycans partly overlap with immunogenic glycans, as these glycans are present on our mucosal surfaces and are being mimicked by microbes that can be immunogenic.

Glycan embeddings learned by a SweetNet model predicting glycan immunogenicity. Source: https://www.cell.com/cell-reports/fulltext/S2211-1247(21)00616-1

Next to other applications, we also combined this GCN for glycans with a recurrent neural network for analyzing protein sequences in order to predict interactions between viruses and glycans. Most viruses, from influenza virus to SARS-CoV-2, require specific glycans on their host cells to infect them. In fact, the match between virus and host glycans can determine the host range of a virus. In the case of influenza virus, a specific protein, hemagglutinin, is responsible for binding to the glycans of a cell prior to cell entry and infection. Different strains of influenza virus have different hemagglutinin sequences and this can affect their glycan binding specificity. One example of this is the difference between avian and mammalian influenza virus. While both types of influenza virus primarily recognize a specific monosaccharide called Neu5Ac, a type of sialic acid, avian influenza virus typically only binds to Neu5Ac in the α2–3 configuration, while mammalian influenza virus prefers Neu5Ac in the α2–6 configuration. A subtle structural shift, yet this is the only barrier preventing avian influenza virus from “jumping over” to infecting humans. Mutate avian hemagglutinin to bind to Neu5Ac in the α2–6 configuration and suddenly you are able to infect humans with this mutated avian influenza virus.

This clear relation between hemagglutinin sequence and glycan binding specificity led us to hypothesize that we could use a model to learn these associations and predict viral glycan-based receptors for influenza virus and other viruses. We therefore built a type of matchmaking model that, given a hemagglutinin sequence and a glycan, would predict whether this would lead to binding or not, in a regression set-up. We were fortunate in that we had a large dataset of experimentally observed interactions of hemagglutinin from various influenza virus strains and a set of glycans that we could use for training and evaluating our model.

Viral receptor predictions from a SweetNet model trained on predicting virus-glycan interactions. Source: https://www.cell.com/cell-reports/fulltext/S2211-1247(21)00616-1

After training, we could indeed show that a trained model recapitulated motifs with α2–3 linked Neu5Ac for avian influenza viruses and with α2–6 linked Neu5Ac for mammalian influenza viruses. In addition, the model also predicted other motifs that could be relevant for binding to influenza viruses, such as sulfated glycan motifs which have been suggested to be possible influenza receptors in the past. We then showed that this study can be extended to other viruses as well, such as rotaviruses which are a common cause for infections in infants. Here, we could show that a trained model predicts highly complex breast milk glycans as binding to rotavirus proteins, which have been independently shown to bind and neutralize rotaviruses, demonstrating the protective effect of breast milk. This neutralizing effect of glycans, by tightly binding to viruses and preventing their binding to cells, is used by our body in various instances and could also present an opportunity to use our model to design new glycans with improved binding properties that, in the future, could serve as a novel kind of antiviral drug.

And that’s all there is to learn about the current state of GCNs in the analysis of glycans! Well, most of it anyways. Head to the paper for more details. Or head to the press releases for more easily digestible information about the implications of our research. Of course, here’s the code for SweetNet and all the used data are available either in the GitHub repo or in the supplementary tables of the paper. Don’t hesitate to reach out if you’re interested in this area & watch this space for future exciting developments in the application of machine learning and data science to glycobiology!

--

--

Machine Learning, Glycobiology, Synthetic Biology. Strong opinions, weakly held. Fascinated & Inspired by Counterintuitives. @daniel_bojar & dbojar.com