Sparsifying Knowledge-Graph Using Target Information

Sparsifying knowledge graphs for supervised tasks, using PMI to remove irrelevant edges; with concrete example using medical data

Published in

Towards Data Science

9 min readMay 5, 2024

Generated at deepai.org — “Medical Knowledge Graph” — generated at deepai.org

The following is a simplified snippet from a research project I conducted while working at Microsoft Research, under the guidance of Professor Elad Yom-Tov.

Prerequisites: Familiarity with supervised machine learning and a basic grasp of Knowledge Graphs, including their utilization in Feature Engineering.

TL;DR

You have a Supervised ML task, predicting a target variable y from high-dimension binary feature-space, and you want to utilise a knowledge graph where each node represents one feature.

The problem is that the knowledge graph is dense. Specifically, there are many irrelevant edges (anecdote: Wikipedia overlink crisis).

The trick: Assess the relevance of each edge by measuring Pointwise Mutual Information (PMI) between its occurrence and the target variable. Eliminate edges with low relevance. Voilà!

Now, let’s move into the details.

First off, in the first chapter we will explore the need of graph sparsification. In Chapter 2 we’ll make it concrete using an example from the medical domain. Then, in the final chapter, we’ll delve into the core topic: graph sparsification using PMI. Enjoy!

First Chapter: The Need

Graphs and specifically Knowledge Graphs (KG) are ubiquitous. The technology is ripening with the advances in both Graph DB infrastructure and with the theoretical ML tools (e.g. Node Embeddings).

A critical challenge when working on Knowledge Graph in the real world is — and in my humble opinion forever will be — density. Even algorithms with polynomial complexity might fail on dense graphs. Looking into the future, if the hardware will advance - the future Knowledge Graphs will increase both in number of nodes and in number of edges. [Moreover, I predict and hope that in this arms race we will gradationally see hierarchical hypergraphs and other monstrosities joining the game.]

The strength of KGs is the fact that they hold complex connections in large domain of knowledge, e.g. all the medical knowledge or all the known astronomical objects, and therefore you can use them to enrich simpler features, for instance, if your features are very sparse, e.g., very rare medical conditions, they can be enriched by “neighboring” pieces of information from the graph.

The enormity comes with a drawback: Knowledge Graphs are often too large. In addition to computational complexity, there is another essential issue here. KGs cover everything in the domain and therefore most of the information is irrelevant to your specific target. In other words, they have over sensitivity (aka coverage/recall) — and therefore low specificity (aka precision). You might say that this is a common tradeoff in ML; what’s special in graphs? The answer is that in typical real-world KNs the degree distribution is heavy-tailed with huge “hubs”, i.e., nodes with extremely high-degree, namely, nodes with many edges. You might be familiar with the effect of such hubs in social-networks and physics (“six degrees of separation”) and in Graph Theory perspective (diameter of random graphs) — in our context this means that a small number of such hubs drastically reduces the distance between most nodes in the graph and therefore takes any signal in the graph and rapidly diffuse it all over the graph, muting any subtler piece of information that might be relevant to your specific objective.

To be concrete, in the next chapter, we will describe the a use-case we worked on and then, in chapter 3, we will describe a method overcoming the density problem for supervised tasks by removing irrelevant edges. Feel free to jump to the 3rd chapter.

Second Chapter: Concrete example

Let’s consider the following supervised task. We’ve simplified the details to focus on the interesting part — the PMI.

The Research Hypothesis

we can forecast potential medical conditions based on a person’s reading history on medical Wikipedia. For example, if an individual has browsed articles on Headache, Smoking, Coughing, and Tooth Discoloration, they may be at increased risk of a particular lung disease. Once more, this is a simplified research scenario and a toy hypothesis.

The Data

The independent variables are given in a tabular data, in a design matrix X. Each entry has binary values, 0/1.
Each row represents one patient — i.e. the wipiedia reading history of one patient, as binary variables — one variable for each medical wikipedia article (this is similar to NLP’s bag-of-words encoding).
Each column represent a medical term from wikipedia.
for instance, for patient p and for a wikipedia article f (eg Pneumonia), X[i, f] = 1 means that patient p had visited article f. namely, “The patient read the Pneumonia article”. Clearly, patients are not checking the exact wikipedia article related to their conditions, but we know from previous works that in some cases they search online for symptoms that are related to the real condition. And this exactly where the knowledge graph becomes handy: “the person read about symptoms that are related to a neighborhood in the graph that is related to specific medical condition”.
We will denote the number of rows (patients) and the number of columns (binary features) — m and n, respectively.
The target (dependent variable) y is future existence of a medical condition, e.g., severe COVID-19.
In addition, we have a knowledge graph (KG), the complete medical wikipedia, where nodes are the wikipedia articles of the medical terms and edges are hyperlinks. Note that this is a directed graph.
For simplicity, assume that there is every feature (column) is represented in the graph as single node.

Using the Knowledge Graph to enrich features is a broad topic, so we won't go into detail here. However, one method worth noting is node embedding, where binary features are transformed into a continuous, lower-dimensional vector space.

Now we can use the medical example scenario in order to get a better intuition for the need of sparsification:

Remember the hubs we mentioned above? While some hubs may hold significance for our target, many consist of irrelevant high-degree nodes, such as “List of medical specialities” or “List of medical symptoms”, such a node will amplify noise in the graph muting important information about a specific rare symptom. But how can we effectively discern these hubs and identify which of their edges are irrelevant and safe to remove from the graph?
Furthermore, there could be a need to utilize the same database for predicting two distinct target variables, such as Female Breast Cancer and Prostate Cancer in men, which exhibit different behaviors. In such cases, employing the entire knowledge graph may not be advantageous, as certain edges may lack relevance to our specific target label. Therefore, it might be necessary to exclude edges associated with the opposite gender. But how do we systematically remove edges that are irrelevant to a specific target variable?

3rd Chapter: The PMI Trick

Our goal is clear: we seek to remove edges within the knowledge graph that lack relevance to our target variable. While multiple mathematical definitions of relevance are available, we have opted to employ Pointwise Mutual Information (PMI) for its simplicity and intuitiveness.

PMI is a fundamental tool from Information Theory, so let’s talk about it: What exactly is PMI? We’ll begin by outlining its definition and then aim to develop a better intuition.

PMI has been described as “one of the most important concepts in NLP” [see 6.6]

PMI: Pointwise Mutual Information

PMI serves as a point-estimator for the well-known Mutual Information between two discrete random variables. Given observed outcomes x and y for two random variables X and Y, we define:

pmi[x,y]=log [p(x and y)/p(x)p(y)]=log [p(x|y)/p(x)] — Pointwise Mutual Information (from wikipedia)

The equalities are immediate results of Bayes’ theorem, providing us with distinct perspectives and, ideally, intuition regarding PMI:

If X and Y are independent, then p(x,y)=p(x)p(y). So, the first term might be understood as the ratio between:

p(x, y) = point-estimate of the actual joint distribution with dependency, and
p(x)p(y) = the joint distribution, assuming independence between the two variables.

Looking on the last term you might recognize that the PMI is quantifying “how does the probability of x changes, given knowledge of y”, and vice versa.

Let’s do a small exercise, to get more intuition into PMI:

assume 1% of all patients had severe covid, p(covid) = .01
Among patients who had pneumonia in the past, 4% got severe covid. p(covid|pneumonia) = .04
then the probability of covid given pneumonia is higher than without information about pneumonia, and as a result the PMI is high. PMI(covid;pneumonia) = log(.04/.01) = 2. Very intuitive, right?

PMI is beautiful in its simplicity, yet there’s much more to explore about its features, variations, and applications. One noteworthy variant is the normalized PMI, which ranges between -1 and 1. This feature enables comparison and filtering across numerous pairs of random variables. Keep this in mind — it will prove valuable shortly.

Back to our task

We have a large dense graph presenting links between out binary features and we have a target variable. How can we sparsify the graph intelligently?

For an edge e between two features v1 and v2, we define an indicator random variable x_e to be 1 if and only if both features have the value 1 (True), meaning the two medical terms coincide for a patient. Now, look on the edge and the target variable y. We asked the simple question: is this edge relevant for y? now we can answer simply with the PMI! if PMI[x_e,y] is very close to zero, this edge hold no information relevant to our target, otherwise there is some relevant information in this edge.

So, to conclude, we remove all edges with:

Where alpha is a hyper-parameter, by tuning it you can control the sparsity of the graph (trading-off with generalization-error, aka risk of over-fit).

Three Caveats — and potential improvements

Caveat 1) The feature space often exhibits sparsity, resulting in zero values for both the numerator and denominator of the PMI, and we better not remove such edges as we have no information about them whatsoever.

You might ask: if we are usually not removing edges, are we really “sparsifying” the graph? the answer is in the hubs. Remember those hubs? they will actually usually NOT be zeros BECAUSE they are hubs.

Caveat 2) Another good question is: why define the edge-variable as “both features have a value of 1”? Alternatively, we could check if either of the features has a value of 1. Thus, instead of y = x1 and x2, we could consider y = x1 or x2. This presents a valid point. These different implementations convey slightly different narratives about your understanding of the domain and may be suitable for different datasets. I suggest exploring various versions for your specific use cases.

Caveat 3) Even if the probabilities are not zero, in the medical domain they are usually very-very small, so in order to add stability we can define conditional PMI:

Conditional PMI

In plain English: we calculate the PMI in a probability subspace, where third event occurs.

Specifically, in the Knowledge Graph, remember that the graph is directed. We will use the cPMI to check if an edge between two features e=(v1,v2) is relevant, given that the the first feature is positive.

In other words, if v1 never occur, we claim that we don’t have enough information about the edge even in order to remove it.

Conclusion

Now, when we know what’s PMI, we understand that in order to remove irrelevant edges in a knowledge graph we can check the pointwise mutual information between occurrences of each edge and the target variable and remove all irrelevant edges. Boom! 🎤