The world’s leading publication for data science, AI, and ML professionals.

Let’s Talk about Graph Machine Learning in Biomedical Networks

A quick overview of the application of machine learning techniques on biomedical graphs

Photo by Sangharsh Lohakare on Unsplash
Photo by Sangharsh Lohakare on Unsplash

Graph Machine Learning is already popular, especially, in the field of social networks but it is relatively lesser-known in Biomedicine or more specifically, in the field of Bioinformatics. I have secured hands-on experience working in this interesting field (a couple of years back) and in this article, I will explain the nitty-gritties of the application of machine learning algorithms to it.

Firstly, Bioinformatics is an interdisciplinary field of biology, computer science, mathematics, etc whose aim is to analyze and interpret biological data. In other words, extract valuable information from the biological data.

Representing real-life data that can be understood both by a human and a computer is not so easy. One datatype that does it well is a graph. Graphs can model and store complex real-life interactions. This is also true for certain types of biological data like protein-protein networks that are represented as a graph.

The following is a list of biological aspects/topics that can be studied in form of a graph:

  1. Drug-Disease Association (DDA) – Different drugs have different effects on diseases. For instance, drug A can treat a disease X and there can be another drug B that causes a disease Y. Hence, drugs can be both therapeutic or harmful, and using DDA our objective is to find out which drug is associated with which behavior.
DDA Illustration (Source)
DDA Illustration (Source)
  1. Drug-Drug Interaction (DDI) – It occurs when 2 or more drugs react with each other. When someone takes multiple medications at once, it is important that we understand the reactive nature of those drugs when combined together. This is to avoid any sort of side effects or unintended reactions that can happen.
  2. Protein-Protein Interaction (PPI) -Proteins help with most biological processes happening inside a cell like gene expression, cell growth etc. The proteins rarely function alone, they tend to form associations (physical contacts) with other protein molecules to carry out different molecular processes inside a cell. Studying such protein protein interactions improves our understanding of the human body at a molecular level.
  3. Protein Function Prediction – As I mentioned before, proteins perform several functions inside a living cell. The protein function prediction is basically assigning a biological role to a protein i.e, identifying which type of biochemical reaction was related to a specific protein.
  4. Semantic Classification of Biomedical concepts – In the digital era, we have access to several biomedical information online in huge quantities. With such huge amounts of data, it is necessary to index and manage them. There exist privacy-related concerns by companies that when distributing data might not reveal entire information relating to a topic. Using medical knowledge bases (like UMLS), we can perform semantic classification with the help of various machine learning techniques (NLP, etc).

All the above-mentioned topics can be transformed into a graph and downstream tasks such as Link Prediction and Node Classification can be performed on them.

DDA, DDI, and PPIs, when formulated as a link prediction task have the objective to identify any potential interactions or associations between the given entities (drugs, diseases, or proteins). There exists a possibility to formulate the problems into different graph-related tasks such as graph clustering, and node classification which would usually depend on the use case.

Protein function prediction and semantic classification can be modeled as node classification tasks.

Check my previous article on graphs and their downstream tasks to understand more about these topics. Here’s the link.

Graph Machine Learning

We cannot use a standard machine learning algorithm directly on a graph as the information stored in it is high-dimensional and non-euclidean. So, we map the graph entities into a low-dimensional vector space (also, called Embedding space) and then apply our favorite machine learning algorithms to it.

Graph ML Pipeline Example (Image by Author)
Graph ML Pipeline Example (Image by Author)

As seen from the above image, the input to the graph ML pipeline is the biomedical graph. We apply a graph embedding method to map the graph into a low dimensional space and compute the embeddings which later would be used to solve the defined use case (it can be link prediction, node classification, or a different type of task like graph clustering).

There exist different types of graph embedding methods that we can broadly classify into 3 types:

  1. Matrix Factorization based method -Here, we factorize the graph (represented as an adjacency matrix) into lower-dimensional matrices while retaining the topological information from the original matrix. There are many variants of this method like Graph Factorization, GraRep, HOPE, etc.
  2. Random Walk based method -This method takes much inspiration from the word2vec model which is a popular technique in Natural Language Processing. Word2vec uses sentences to learn embeddings for words. A similar approach is used to generate node embeddings in graphs. Usually, a node is selected (randomly or using a condition) and we move (or perform a "walk") to another node randomly for a defined number of steps. This way, we obtain node sequences of a defined length (same as the number of steps) which then are used to learn node embeddings. Some examples include DeepWalk and node2vec.
  3. Neural Network based method -In recent years, more research was put into neural network based methods as they showed more promising results than the other methods. Many types of neural networks like multilayer perceptron (MLP), autoencoder, generative adversarial networks (GAN), and graph convolutional networks(GCN) were employed for computing graph embeddings. LINE and SDNE are some examples of this method.

All the above mentioned methods are used to calculate static graph embeddings. Static in the sense that they are computed for a graph at a fixed time period (not changing or evolving).

Dynamic Biomedical Graph

Consider this (for the case of PPIs)— Inside the human cell at time T=0, a cluster of proteins were involved in a biochemical reaction "A" and at T=1, a different set of proteins contributed to a different reaction, say "B". And as time passes, different combinations of proteins get involved in a varied amount of reactions. This is what actually happens in a cell. Hence, there is a necessity to study biomedical graphs with temporal information.

What we considered before was the simplest case of a complex real-life process. By eliminating the time factor, we were able to simplify the problem by many folds.

If we add the time aspect to the graph, it’ll turn into a Dynamic Graph. These are more complex (but more interesting) than static graphs. You can think of a dynamic graph as a list of static graphs collected from time=0 to time=T, in that order (a series of snapshots of graphs). We have to use dynamic graph embedding methods to map the dynamic graph into low-dimensional space and then perform the tasks. The ML pipeline used here is quite similar to the previous one.

A naive approach to dynamic graph embedding is to consider each time point T independently and compute static embeddings. For instance, if we have a dynamic graph with 12 time points, we could have a dynamic graph embedding for all 12 time points (calculated independently for each time point). This is not a good idea because we are not capturing the effects of time on the evolution of the graph.

There exist advanced techniques where the influence of previous time points is taken into account when calculating the embedding at the current time point. Thus, we capture the evolution of the graph through time and obtain a good quality of embeddings.

I have written an article explaining dynamic graphs in detail which can be found on this link.

Final Thoughts

The research output on Graph Machine Learning especially, graph neural networks is very high these days and we get to see impressive results obtained by these methods. Formulating different tasks on biomedical graphs enables us to tackle some bottlenecks of the traditional lab experiments. In lab experiments, it is not possible to identify every possible protein protein interaction or DDI, DDA as we have technical limitations in using physical tools for measurement.

This is where ML comes in. We train the model using the data with already identified interactions (through lab experiments) and try to predict the potential interactions or associations. This is exciting as it opens doors for faster simulations and a speedy drug discovery process.

The covid-19 pandemic has shown us how valuable is biomedical research and the graph ML is a tool that aids this research. If you are someone just getting into graph ML, it might be interesting to check the biomedical field for some interesting and challenging problems.


Thanks for reading and cheers!

Want to Connect?
Reach me at LinkedIn, Twitter, GitHub or just Buy Me A Coffee!

Related Articles