Neo4j for Antibiotic Resistance

An alternative view of the CARD database

Sixing Huang
Towards Data Science

--

The introduction of antibiotics was a milestone in our public health history. They are medicines used to prevent and treat bacterial infections such as pneumonia and tuberculosis. Antibiotics have saved literally millions of lives.

However, their overuse and misuse have led to the emergence of antibiotic resistant bacteria. These bacteria can survive the antibiotics because they possess resistomes — genes that confer resistance to antibiotics. Some of these genes encode proteins that can either decrease the import, increase the export or deactivate the drugs. Others encode mutated drug targets that evade the attack of antibiotics. And the vertical (between mother and daughter cells) and horizontal (among different bacteria) transfers of antibiotic resistance genes lead to the rapid widespread of antibiotic resistant bacteria around the globe.

In his book The Plague Cycle: The Unending War Between Humanity and Infectious Disease, author Charles Kenny has described our current worrying status of antibiotic resistance.

But the best estimates are that methicillin-resistant Staphylococcus aureus (MRSA) kills eighteen thousand people a year in the United States alone. The toll in Europe and the US from antimicrobial resistance as a whole is around fifty thousand a year. Worldwide, resistant bacteria already kill as many as seven hundred thousand people a year — seven times the mortality burden of cholera and six times that of measles. And resistance is a problem that has the risk of exploding.

And a post-antibiotic future is bleak:

A review of the antimicrobial threat, sponsored by the British government and chaired by economist Jim O’Neill, predicts that, by 2050, 10 million people could be dying a year from increased antimicrobial resistance worldwide if we fail to act. That’s more than die worldwide each year from cancer. It compares to a World Health Organization estimate of 250,000 additional deaths each year from the effects of climate change between 2030 and 2050. Global deaths from terrorism run at about one one-thousandth of the potential toll of antibiotic resistance.

So we need to act fast and we need scientific data. The Comprehensive Antibiotic Resistance Database (CARD) is one prominent data source.

Figure 1. The Comprehensive Antibiotic Resistance Database. Screenshot by author.

Since 2013, the CARD database has collected over 3,300 gene sequences and their associated antibiotics. The staff carefully curated the collection and organized the data in Antibiotic Resistance Ontology (ARO) and Antimicrobial Resistance (AMR) gene detection models. The database also has provided us with bioinformatic tools for data analysis. CARD has become an important data source for both research and industry. According to Andrew G. McArthur, the lead investigator of CARD, said that in Europe, a run through CARD is a requirement for the probiotic product approval.

Each drug, antibiotic resistance and gene has its own page. For example, the page of acrD efflux pump looks like this:

Figure 2. Page of acrD in CARD. Screenshot by author.

The page shows the details of the gene, its prevalence, the antibiotics that it can act against and so on. However, there are currently no page for the class “resistome”, “pathogen” to be precise. So frequent questions such as “which antibiotics is this bacterium resistant to” cannot be answered directly in CARD.

In my previous articles “Analyzing Genomes in a Graph Database” and “Neo4j for Diseases”, I have demonstrated that we could transform biological data from relational databases into graphs and discover new insights. Although there are many graph databases out there, Neo4j is without doubt the leader in this space. It is easy to use and to scale. Its Cypher language is straightforward. Last but not least, the Graph Data Science Library offers many powerful graph manipulation and machine learning capacities. In this article, I am going to show you how to convert CARD into Neo4j and get some fast facts from the data. The code for this project is hosted in my Github repository here:

1. Import the data

From the CARD download page, download and unzip both Ontology files and CARD Prevalence, Resistome, & Variants data. In fact, we only need aro.obo and card_prevalence.txt from the two. Run all the cells in my analysis.ipynb. It converts the two files into five CSV files ready for import. You can find all these files in my repository.

Launch Neo4j Desktop, create a project called card and put all these CSV files into its Import folder (read the paragraph “2. Import data into Neo4j” if you need help). Run the following commands in Neo4j Browser to import and index the data in Neo4j:

These commands will create three types (labels) of nodes (drug, pathogenand resistance). They are connected by two types of relationships (confers_resistance_to_drug_class and has_resistance).

2. Overview

After the import, we can get some quick overviews about the data. Run the following three queries to get the total counts of our three node types:

There are 43 drugs, 263 pathogen bacteria and 2,640 resistance mechanisms. It is noteworthy that there are discrepancies between the downloaded data and the web data. For example, SHV-52 (ARO:3001109) has no confers_resistance_to_drug_class relationship according to aro.obo. But its webpage indicates it can help the host to resist penam, carbapenem and cephalosporin (Figure 3). In order to avoid confusion, this article is based on the downloaded CARD data only. The second caveat of this project is that the taxonomic resolution is down to the species level, not the strain level, which could have been more precise.

Figure 3. The data of SHV-52 are inconsistent between web and download. Screenshot by author.

We can also quickly create a topological overview of the CARD data with Neo4j Bloom (read my article here for instructions).

Figure 4. The topological overview of the CARD data. The green dots represent drug nodes, the purple ones are pathogens and the orange ones are resistance nodes. Image by author.

In Figure 4., we can see that the data are largely clustered around the purple pathogen nodes. In Neo4j Bloom, we can identify some of those large “hubs”. They are Escherichia coli, Klebsiella pneumoniae, Pseudomonas aeruginosa and Acinetobacter baumannii. These are also the most studied antibiotic resistant bacteria in scientific reports.

3. Investigate the superbugs

Superbugs are multidrug-resistant bacteria, that is, they are resistant to more than one antibiotic. They can have multiple genes and each one provides resistance to one antibiotic. Or one gene can act against many antibiotics (read this article for more details). Carbapenem-resistant Enterobacteriaceae (CRE) have become resistant to “all or nearly all” antibiotics, including carbapenems, the “treatment of last resort”. In 2015, bacteria that are resistant to colistin, another drug of last resort, were observed in patients and livestock in China. According to the CDC, they infect two million people and kill 23,000 each year in the United States. So there is an urgent need to learn more about these organisms.

We can first check the top 10 superbugs with the most resistance connections. Please notice that the OPTIONAL MATCH clause is necessary because in CARD, many resistance nodes are not connected to any drug.

And it returns:

Klebsiella pneumoniae is normally found on the mucosal surfaces in the gastrointestinal tract (GI). But once it enters the human body, it can be quite virulent and antibiotic-resistant. As its name suggests, Klebsiella pneumoniae can cause pneumonia, and it is the most common cause of hospital-acquired pneumonia in the United States. But this bacterium can also bring other ailments such as bloodstream infections, wound and surgical site infections, and meningitis. Our analysis shows that K. pneumoniae is connected to over one thousand resistance nodes. These resistance mechanisms provide resistances to 13 antibiotics, also the most in our dataset.

With the following query, we can get more details:

It returns a graph:

Figure 5. Details of antibiotic resistance in Klebsiella pneumoniae. Image by author.

The graph shows that K. pneumoniae can resist cephalosporins with KpnEFGH. They are all parts of the major facilitator superfamily (MFS) antibiotic efflux pump that pumps out antibiotics from the cell. According to Ashurst et al., K. pneumoniae can also degrade cephalosporins with its extended-spectrum beta-lactamase (ESBL). So another antibiotic — carbapenem has become a treatment option. But our results show that this option can be ineffective either because the LptD mechanism alone can confer resistance to carbapenem and three others. In fact, approximately 80% of the 9,000 carbapenem-resistant Enterobacteriaceae infections reported to the CDC were caused by K. pneumoniae in 2013. In addition, the qacE, Edelta1 and L genes can protect the bacterium from disinfecting agents and intercalating dyes.

4. The most resisted antibiotics

With Neo4j, we can also investigate which antibiotics are most likely the targets of antibiotic resistance:

And the results are:

Fluoroquinolones interferes with DNA replication in bacteria and causes cell death in both Gram-negative and Gram-positive bacteria. But the CARD data indicated that both types of bacteria have developed many strategies to resist its effect (Figure 6.):

Figure 6. Resistance against fluoroquinolones among bacteria. Image by author.

On the one hand, Gram-positive bacteria such as Bacillus and Staphylococcus use MFS antibiotic efflux pump systems such as blt and qacAB to pump out fluoroquinolones. On the other hand, Enterococcus encodes PmpM, a drug antiporter that belongs to the Multidrug And Toxic Compound Extrusion (MATE) family. However, PmpM is also found in the Gram-negative such as Stenotrophomonas maltophilia and Serratia marcescens. Pseudomonas, another group of Gram-negative bacteria, contain gene sequences similar to abaQ, a gene that encode “an MFS transporter responsible for the extrusion of quinolone-type drugs in A. baumannii”.

Cephalosporin, the second item in the list, shows a very different topology.

Figure 7. Resistance against cephalosporin among bacteria. Image by author.

In contrast to the many resistance mechanisms against fluoroquinolone antibiotic, there are only four mechanisms against cephalosporin. And they are used by a total of 52 pathogens. Some pathogens, such as Klebsiella oxytoca and Raoultella planticola, have all four resistance mechanisms.

Conclusion

This article shows how Neo4j can deliver quick insights into CARD. In my opinion, CARD contains so many relations that a graph database is a natural fit for it instead of a relational one. In its relational form, the answer to the question “which antibiotics is this bacterium resistant to” is not immediately obvious because CARD connects bacteria and antibiotics through resistance mechanisms. This is a non-issue for Neo4j. We can formulate the question simply in Cypher and get the answer right away.

But the downloaded CARD data are not always consistent with the online data. Perhaps I have missed some links. Also, some connections are rather speculative. For example, according to the description, cdeA confers resistance to fluoroquinolones in E. coli and acriflavine in Clostridioides difficile. But our data connects C. difficile with fluoroquinolones through the cdeA mechanism too.

In this project, I have only analyzed the bacterial species based on the card-prevalenc.txt. But it is possible to get the strain resolution if we make use of the NCBI accession numbers delivered by CARD. Then the results would have been more specific and precise. Of course, it would have been easier if CARD compiles the pathogens on the strain level in the first place.

This study has also shed some light on the antibiotic resistance situation. The graphs above showed that resistances mechanisms outnumber pathogens 10:1. It means that pathogens can have on average ten mechanisms to act against the effect of antibiotics. It was also clear that there are way more resistance mechanisms than antibiotics. As the development of new antibiotics slows down, we are running out of antibiotics.

Please try this project and tell me your experience.

--

--

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.