Graph Database, GraphQL and Machine Learning for Carbohydrate-Active Enzymes

Turbo-charge your carbohydrate genome analyses with Neo4j

Sixing Huang
Towards Data Science

--

This article shows how to:

1. Convert the CAZy database into Neo4j

2. Analyze the CAZynome of Formosa agariphila KMM 3901 to gain new insights

3. Build a GraphQL API for language-agnostic data access

4. Perform graph embedding and node classification for cellulose degradation prediction

What is the most abundant polymer in the world? The answer may surprise many: cellulose. It is a polysaccharide found in the plant cell walls. And we make it into paper, T-shirt and cellophane. And the second place is also taken by a polysaccharide: chitin. It is inside the cell walls of fungi, the scales of fish, the shells of crabs, shrimps and lobsters. In fact, all cellular organisms cannot live without polysaccharides.

Figure 1. Plants contain cellulose and fungi contain chitin, two of the most abundant biopolymers on earth. Photo by Sangga Rima Roman Selia on Unsplash.

Polysaccharide is a form of carbohydrate. As its name suggests, a polysaccharide is made up of multiple monosaccharides, a.k.a simple sugars. Monosaccharides are chained together via glycosidic bonds. There are several types of monosaccharides (glucose, xylose and so on) and several types of glycosidic bonds (alpha-1,4 and beta-1,3 and so on). The various combinations of monosaccharides and glycosidic bonds create the diversity of polysaccharides. For example, glycogen, starch and laminarin are also polysaccharides. But they are storage compounds and act as an energy reservoir in the cells. The cells can break down these molecules into monosaccharides easily in a process called hydrolysis and then oxidize the monosaccharides to gain fuel. In contrast, both cellulose and chitin are structural components. They are tough to break down so that they can support and protect their host cells. Humans consume both glycogen and starch, but can digest neither cellulose nor chitin.

That is where fungi and bacteria come in. Whenever you see rotten woods, wood-decay fungi are at work. And herbivores can break down cellulose because of a community of cellulolytic microbes in their gastrointestinal tracts. Chitin is mainly degraded by fungi and bacteria as well. Therefore, thanks to their work, our planet has not yet been blanketed by these two polysaccharides.

Figure 2. Home page of the CAZy database. Screenshot of cazy.org.

Those fungi and bacteria can accomplish such great feats because they produce certain kinds of enzymes for breaking the glycosidic bonds in cellulose or chitin. Enzymes that synthesize or degrade polysaccharides are called carbohydrate-active enzymes (CAZymes). The research into CAZymes is not just about manufacturing and nutrition. It also brings us new knowledge in ecology. Ultimately, it holds the key to the next-generation biofuels, too. To better understand the biorecycling of polysaccharides, a group of researchers led by Prof. Henrissat have set up a database called CAZy, short for Carbohydrate-Active enZYmes. They have collected over a million CAZyme sequences and put them into six categories based on their interactions with the carbohydrate substrates:

  1. Glycoside hydrolases (GHs) are enzymes that hydrolyze or rearrange glycosidic bonds.
  2. Glycosyltransferases (GTs) are the ones that form glycosidic bonds.
  3. Polysaccharide lyases (PLs) are enzymes that perform non-hydrolytic cleavage of glycosidic bonds.
  4. Carbohydrate esterases (CEs) hydrolyze carbohydrate esters.
  5. Auxiliary activities (AAs) are redox enzymes that act in conjunction with CAZymes.
  6. Carbohydrate-binding modules (CBMs) are the non-catalytic domains of CAZymes. CBMs bind the CAZyme and the carbohydrates together so that the catalytic reaction can take place.

Inside each category, the researchers have grouped the sequences into different families and even subfamilies based on sequence similarities. As of this writing, there are 171 GH families and 114 GT families.

Figure 3. GH13 in detail. Screenshot of cazy.org.

Because CAZy is based on sequence similarity and similar sequences do not necessarily have the same functional activities, CAZyme families often contain different members that act on different carbohydrates or at different glycosidic bonds. Sometimes, the substrates are similar within a family. For example, sequences of pullulanase (EC 3.2.1.41) and alpha-amylase (EC 3.2.1.1) are found in GH13 (Figure 3.), both of which are polysaccharides with alpha glycosidic bonds.

The CAZy team also annotated over 10,000 genomes from Archaea, Bacteria, Eukaryota and viruses and put the results on the website. All the CAZymes within an organism constitute its CAZynome. The CAZynome decides which carbohydrates that the organism can metabolize — i.e. synthesize or degrade.

The CAZy database is an extremely valuable resource for genome analyses. However, the CAZy website is more like a text website other than a data portal. Descriptive information is displayed but no download or API is provided. It does not offer sequence annotation service (they only accept offline academic collaboration). And its access can sometimes be erratic.

To make CAZy more accessible, I have previously written an article about CAZy annotation with AWS Batch. In it, I implemented in the cloud one of the most sought-after feature requests for CAZy. My next goal is to build a CAZy website mirror for my own data analyses. I have previously shown how to use Neo4j to host genome annotations because a graph database is an excellent choice to store data with complicated relationships. It is also very easy to construct GraphQL APIs for data access. Finally, Neo4j can perform graph embedding and node classification for machine learning. This capability will be very useful for metabolism prediction in bioinformatics.

My previous article explains how to start a Neo4j project. In this article, I created a project called cazy in Neo4j Desktop and enabled the APOC and Graph Data Science Library plugins. Later, I set up GraphQL for the database via Apollo and neo4j-graphql.js. Finally, I predicted which genomes are capable of degrading cellulose with the Neo4j Graph Data Science library.

The code for this project is hosted on my Github repository here:

1. Download and process the data

For this project, I collected the taxonomies and CAZy annotations of all the bacterial and archaeal genomes and the family descriptions from CAZy. The descriptions of all their four-digit EC numbers, i.e. EC without wildcard, were retrieved from the KEGG database. In addition, I annotated the cellulose degradation capabilities of 186 genomes from primary strain description papers and from the bacterial metadata database BacDive. The cellulose degraders have a cellulose value of 1, while the non-degraders have 0. 2 stands for unknown. The data are structured in Neo4j as follows:

Figure 4. The data model of the project. Image by the author.

All the scripts for download and post-processing are available in my repository, as well as the raw and formated data. Put all the CSV files from “data_for_neo4j” into the “import” folder of cazy. Afterwards, import and index the data with the following commnads in the Neo4j Browser:

After the import, we can calculate some basic statistics about the database. For example, the query below calculates the average amount of GH families among all the genomes. The result is 16 GH families.

Next, I took a new look at the GH families within the genome of Formosa agariphila KMM 3901, which was analyzed by my colleague Alexander Mann in 2013 but its CAZy annotations have since been updated.

Figure 5. The GH families in KMM 3901. Image by the author.

This bacterium was isolated from the green alga Acrosiphonia sonderi and can degrade a wide range of algal polysaccharides. The graph above shows that KMM 3901 possesses 37 GH families, more than twice the average of 16. Compared to the annotations done by Mann et al., GH130, GH149, GH158, GH168 and GH171 are new additions in the GH category.

2. CAZynome analyses in Neo4j

In this section, I can do some analyses similar to those in Analyzing Genomes in a Graph Database. Again I took the genome of Formosa agariphila KMM 3901 as my example. As of this writing, there were five annotated genomes under the genus Formosa.

Figure 6. The five Formosa genomes in the CAZy database. Image by the author.

I can quickly display the unique CAZy families of KMM 3901 by issuing:

The results show that KMM 3901 has eight unique GH and two unique PL but nothing new in the GT, CE or CBM categories. Both PL28 and PL37 contain ulvan lyase. It is an enzyme that degrades ulvan, a complex sulfated polysaccharide present in the cell walls of the green algae Ulva (Chlorophyta). In the GH category, the list contains GH86. This family harbors beta-agarase and beta-porphyranase. Both attack the sulfated carbohydrates in red algae. Furthermore, GH168 contains endo-alpha-(1,3)-L-fucanase (EC3.2.1.211) that breaks the (1→3)-α-L-fucoside linkages in fucan, a sulfated polysaccharide found mainly in different species of seaweed. Finally, the unique GH28, GH78 and GH105 are involved in pectin degradation. Pectins or pectin-like carbohydrates are found in green algae. In summary, these new results revealed the unique agarolytic life strategy of KMM 3901 and its ecological niche away from its sisters. They also added some important details to the study by Mann et al..

Next, I compared the combination of GH families in KMM 3901 with all the bacterial and archaeal genomes in the database. Among the most similar genomes, the four sisters were not even in the top one hundred.

Members from the phylum Bacteroidetes have completed the top ten list. Taxonomically, Mucilaginibacter, Chitinophaga and KMM 3901 differ on the class level. Surprisingly, unlike KMM 3901, all these top ten bacteria were isolated from land. Unfortunately, information about their carbohydrate degradation capability was insufficient and only some information about their close taxonomic neighbors was available. For example, Mucilaginibacter contains members such as M. paludis and M. gracilis that can degrade pectin, xylan and laminarin. Unfortunately, neither of their genomes have been yet annotated in CAZy.

3. Serve CAZy with GraphQL

Although Cypher is an easy query language to learn, it is not the only way we can fetch data from Neo4j. GraphQL is the alternative. On the one hand, its execution is language agnostic so that developers without mastering Cypher can also access the data. On the other hand, modern microservice architecture frequently requires each component to communicate through HTTP APIs. And GraphQL is a good way to expose data from Neo4j in HTTP. Compared to REST, users can fine-tune their requests to avoid excessive data delivery. To learn more, please read my previous article “Build Your Own GraphQL GenBank in AWS”.

It is surprisingly easy to set up an Apollo server for GraphQL in Neo4j. The Neo4j team have developed a Javascript library called neo4j-graphql.js. This library auto-generates the query and the resolver codes. In fact, the only thing that a user needs to do is to define the typeDefs schema. The client can then query the data without even a single line of Cypher. The following 58 lines of code in the index.js can power up a GraphQL server for this entire project (please first fill in the password of the database in line 48):

Lines 5–44 define the schema for accessing the taxon, genome, cazy and ec data. The DataPoint type helps us with retrieving the amount of each CAZy family.

Run this script in the Terminal:

nodemon index.js

To test the API, direct your browser to http://localhost:4000. Enter the following query into the left panel and press the round button:

Figure 7. Retrieving the taxonomic information of Formosa and the CAZynomes of its five children genomes. Image by the author.

With just one single query, GraphQL returned the taxonomic details of Formosa and the CAZynomes of all its five species. The data were in the JSON format. A Python client is as simple as:

It is also quite easy to mutate data in GraphQL for Neo4j. For more details please refer to the “DOCS” and “SCHEMA” tabs at the right edge of the page (Figure 7.).

4. Predict cellulose degradation with node classification

One of the goals in CAZy annotation is to predict what kind of carbohydrates that an organism can metabolize. Although the exact percentage has been under constant debate, it is quite clear that the majority of the microorganisms have not yet been cultivated in the lab. Fortunately, we can still obtain their genomes through metagenomics or single-cell genomics without cultivating them. If we can predict their carbohydrate substrates, we can hypothesize the ecological roles that they play in their natural habitats and their industrial potentials. And these predictions may perhaps help us with finding the suitable media for their cultivation.

It is logical to use machine learning for this task. But we need two good sources of data: the features (the Xs) and the targets (the Ys). Since CAZy is all about carbohydrate metabolism and over 10,000 genomes have been annotated in the database, they serve as good features. The glaring problem however is the scarcity of the target biochemical data. On the one hand, many annotated species in CAZy have not yet been studied biochemically. On the other hand, many bacteria that have been studied biochemically have not yet been sequenced. The overlaps of the two are very small.

For this project, I manually annotated the cellulose degradation capabilities for 186 genomes as model training data. My sources were primary strain descriptions and BacDive. Among them, 43 were positive and the other 143 were negative for cellulose degradation. All the other genomes served as holdout samples.

After the target data were somehow managed, I moved on to the feature data. Similar to natural language processing (NLP), I had to represent all the genomes in fixed-length vectors before machine learning. But our genomes were nodes in a graph and each of them had various combinations of CAZy connections. Fortunately, the Graph Data Science Library in Neo4j provides us with graph embedding. Analog to the word embedding in NLP, graph embedding generates a fixed-length vector for each node based on its graph topology. We can then use only these vectors as features for the machine learning.

The gds.graph.create part creates an in-memory graph projection to hold all the necessary data for the machine learning. The gds.fastRP.mutate part calculates the graph embedding and writes the results back into the memory graph.

Next, I called the nodeClassification.train method to start the training. Under the hood, Neo4j uses multiclass node logistic regression. I also set up a five-fold cross-validation and configure “F1 weighted” as the metric.

Figure 8. The training and testing scores. Image by the author.

The training score of 67% and testing score of 65% were both low (Figure 8.). I tried with other parameters and they did not improve the results.

Nevertheless, I still applied the model to predict the cellulose degradation in the holdout data with:

This model predicted eleven genomes positive for cellulose degradation:

Among them, firstly, Micromonospora carbonacea is a known cellulase producer and it degrades the cellulose cell wall of Phytophthora. However, Fu et al. reported that its sister HM134, the second one in the list, is negative for cellulose degradation. Their result is a bit surprising because HM134 shares 58 out of the 61 CAZy families in M. carbonacea and the three exceptions (GH154, GT75 and CBM22) have no obvious connections with cellulose metabolism. Thirdly, Streptomyces sp. SirexAA-E is known to be capable of deconstructing lignocellulosic biomass. Fourthly, Xanthomonas citri pv. citri UI7 is a cellulolytic plant pathogen responsible for type A citrus canker. Fifthly, whether Streptomyces prasinus ATCC 13879 can degrade cellulose is undetermined according to BacDive. Finally, little is known about the remaining six in the list.

Conclusion

Once again, this article has shown that Neo4j is a powerful analytic platform for bioinformatic research. The connection-rich data in CAZy are better modeled in a graph database than in a relational database like MySQL. The Cypher language is easy to learn and brings a new way of thinking to the users. The CAZy annotations are now in a graph, in which genomes are connected via their shared CAZymes and CAZy families are connected via shared EC numbers. This can bring forth new discoveries from the data because it is more natural for us humans to analyze complex things in context. All in all the insights have already been in the data, Neo4j just makes it more visible.

GraphQL can deliver data from different data sources as long as they conform to its specification. It allows developers without Cypher knowledge to extract data efficiently. And it can surely convert Neo4j into a microservice for a larger application. However, it does not replace Cypher. GraphQL cannot provide us with the connection-driven query experience, let alone the aggregation and machine learning functionalities.

One of the holy grails in bioinformatics is to predict metabolism based on genomic data. The exercise in cellulose degradation prediction highlighted the challenges. The small training data set certainly contributed to the low training and testing scores. But we also need to ask ourselves, whether the sequence-similarity-based CAZy is really suitable for metabolism prediction with machine learning. So far, there is no published study with any success in this avenue. Perhaps the function-similarity-based EC numbers are better features than CAZy?

Genome sequencing is getting cheaper and easier. Metagenomics and single-cell genomics are also getting better and better. As a result, genomic data are no longer the bottleneck. In contrast, the metabolic data are hard to generate. In fact, the gap between the genomic and the metabolic data can only get larger and larger in the future. Biochemical tests are labor intensive and error prone. Bacterial strain description papers often neglect to mention carbohydrate metabolism. Also, the digitalization of the results is time consuming. Finally, there have been many dedicated databases for genomic sequences, but none could be found for the metabolic data until the 2012 release of the BacDive database. Since then, more metabolic data have been made available but they are still not sufficient for any large-scale machine learning project.

To close the data gap, the BacDive team on the one hand are working hard to collect more biochemical metadata from primary research papers. On the other hand, they are planning to test many sequenced bacterial strains within their DiASPora project. Also, strain descriptions should be standardized and mandatory biochemical tests should be required for their publications.

Finally, although the article has used Neo4j Desktop, it is actually quite easy to move it into the cloud serverlessly. For more information, please read the excellent article “Working With Spatial Data In Neo4j GraphQL In The Cloud” by William Lyon. Once migrated to the cloud, the Neo4j CAZy can be open to the public or a defined group of users with high availability. The serverless option also reduces costs.

In the end, I thank the CAZy team sincerely for all their hard work. And you can read more about the database in their publication here: https://pubmed.ncbi.nlm.nih.gov/24270786/.

--

--

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.