Part 2 of PGM series to set up the conceptual understanding to build Bayesian networks

This is a part 2 of PGM series wherein I will cover the following concepts to have a better understanding of Bayesian Networks:
- Compute Conditional Probability from joint distribution – Reduction and Normalization
- Marginalization
- Types of structures – Chain, Fork and Collider
- Conditional Independence and its significance – d-sep and Markov Blanket
- Types of reasoning: Diagnostic, Prognostic and Intercausal
From the previous article on the introduction to probabilistic graphical models (PGM), we understand that graphical models essentially encode the joint distribution of a set of random variables (or variables, simply). And Joint distribution, in turn, can be used to compute two other distributions – marginal and conditional distribution.
Intuition behind each of these distributions:
- Marginal probability is the probability of a single event or variable with no reference to any specific range of values of any other variable, for e.g. P(A).
- In most real-life examples, there are multiple processes at play. Hence, a study of the probability of simultaneous events e.g. A, B and C is required to draw inferences. It is called as joint probability and is denoted by P (A, B, C)
- How the probability of event A changes given the prior information about occurrence of one event B is measured by conditional probability and is written as P(A|B).
Let’s take a quick real-life example:
- Marginal probability – Probability of Recession without any knowledge of pandemic hitting the world, P (Recession)
- Joint probability – Probability of ‘Recession’ and ‘Pandemic’ occurring together, P (Recession and Pandemic)
- Conditional probability – Probability of Recession given Pandemic, P(Recession|Pandemic). Pandemic showed up first leading to economic slowdown and recession. Hence, conditional probability in this case is interpreted as – given the evidence of pandemic, how likely is the recession?
Compute marginal and conditional distribution from joint distribution using below example from previous post:

Conditional distribution:
Let’s assume that the joint probability database of above network looks like below:

Let C_1 be the observed state of a variable C, then the conditional distribution of variables A and B given the observed state C_1 is calculated by marginalizing the joint distribution at observed state.
Mathematically, this can be accomplished in two steps as illustrated below:
- Reduction: Eliminate the rows with unobserved state i.e. C_0 in this case:

Note that the numbers in the ‘Probability’ column do not sum to 1 which implies that it is not a probability distribution. So, we get un-normalized measure as a result of reduction step. This calls for our next step which is ‘Renormalization’.
- Renormalization: Take the sum of the remaining rows and divide each un-norm measure by the total sum to get the normalized probability distribution. Note that the numbers in ‘Norm’ column now sum to 1

Marginal distribution:
Marginalization is the process of producing a distribution over a single variable or a subset of variables from a larger set of variables, without any reference to an observed set of variables.
To calculate the joint distribution over A and B, we marginalize P (A, B, C) over all the states of C:

Common network structures:
Illustrating the network structures with examples (curated based on author’s personal intuition) from healthcare domain.
How variables like Age (A), Diabetes (D), High Cholesterol (C) and Glucose (G) are associated with Heart Disease (H)? The variable notation mentioned in brackets is mainly used while writing the joint distribution expression of the structures.
Based on domain/empirical knowledge, let’s try reasoning from a doctor’s lens and think how he would link the knowledge of Age with the Diabetes profile – In the absence of any other information, it is safe to assume that as age increases, there is a high chance of being diabetic.
This association is represented by an edge originating from Age to Diabetes, where Age is the causal factor to Diabetes and the two variables are directly connected with each other.
Note that the observed node is shaded grey in the below structures.
Let’s extend this to understand the common types of network structures:
- Chain: High Glucose content can lead to an individual develop Diabetes. This is represented by an edge from Glucose to Diabetes with direct connection between them. Further, if Diabetic patients are more probable to suffer from Heart Disease, then there is an edge between the two.
Chain structure representing high Glucose as cause of diabetes which in turn can lead to higher risk of heart disease, is shown below:

We can write the joint probability distribution as:
P (G, D, H) = P(H|D) P(D|G) P(G)
Chain structure is further sub-divided into two categories – 1) Causal and 2) Evidential.
The example illustrated above represents causal chain structure. The difference between causal and evidential chain structure mainly stems from the direction of arrows highlighting the cause-effect relationship.
Causal structure goes top to down i.e. it starts from the cause and traverses through the effect on the variables following it in a chained structure. However, evidential structure says that given the evidence, what can we infer about its cause i.e. it’s a bottom-up relationship.
Trail: If P is defined as the trail from Glucose to Heart Disease, then knowing about Diabetes tells us everything about the cause of Heart Disease. Hence, knowing the Glucose % does not give any extra information in finding the causation of Heart Disease if D is known. So, in this structure, Glucose is independent of Heart disease given Diabetes.
Another way to interpret this structure and the concept of Trail: If we are aware of the presence of the diabetic status of a person’s heath, then it explains away everything about the risk of heart disease, thereby blocking the flow of information which ‘Glucose’ status could have contributed about the heart disease in the absence of Diabetes information.
Thumb rule: Any node observed in the path blocks the flow of information making the connecting variables independent from each other.
It’s vital to comprehend how the impact of the evidence of one variable makes the two (related) variables independent. This will help to understand the concept of d-sep and Markov blanket later, as we progress in the post.
- Collider/Common Effect: If we know that a person has High Cholesterol, then mere this knowledge does not tell us about the likelihood of a person being diabetic. So, the two variables are independent as one doesn’t share any information about the other.
However, when they are connected to Heart Disease as shown in the structure below, it suggests that upon observing the common effect of Heart Disease, Cholesterol shares information on his Diabetes profile.
Hence, our takeaway from this structure is that the two causes were sitting in silos till we observed the common effect i.e. Heart disease. This event leads to the development of some relationship between the 2 variables – Cholesterol and Diabetes, making them dependent on each other.

Joint probability distribution is written as below:
P (C, D, H) = P (H|C, D) P(D) P(C)
Trail: Observing Heart Disease or any of its descendants (the edge emerging from Heart Disease and connected to another variable) brings out the dependency between the two common effects, while lack of knowledge of Heart Disease leaves Cholesterol and Diabetes independent of each other.
This concept gets beautifully internalized by water valve example explained by Prof. Daphne Koller. Would highly encourage to go through these concepts in depth from her PGM specialization course hosted at Coursera.
The water valve analogy for our example is as follows: Observing Heart Disease (or its descendants) opens the channel to allow the flow of information from Diabetes to Cholesterol bringing out the dependency relationship.
- Fork/Common Cause: In general, Diabetes and Heart Disease are strongly dependent, but if we introduce variable Age, then we are asserting that given the age of a person, there is no further relationship between the Diabetes status and risk of heart disease in a person. Simply put, the hidden variable Age explains all the observed dependence between Diabetes and Heart Disease.
That is, if older people are more prone to being diabetic and carrying the risk of heart disease, then Age becomes the common cause for the two effects as shown in the network below:

The joint probability distribution is written as follows:
P (A, D, H) = P(H|A) P(D|A) P(A)
Trail: Lets follow the same analogy of water valve to understand the flow of information in this inverted V structure of common cause.
If we observe Age, it stops the flow of information between the two variables Diabetes and Heart Disease making them independent of each other. Inversely, not knowing anything about the variable Age leaves the channel open and lets the information flow. This makes the variables dependent.
Conditional Independence and its significance – d-sep and Markov Blanket
The phenomenon of conditional dependence in which two variables become independent of each other in the presence of a third variable is called as d-sep.
As per definition,
Two nodes u and v are d-separated by Z if all trails between them are d-separated. If u and v are not d-separated, they are d-connected.
The notion of d-sep renders it easier to understand another key concept – Markov Blanket: A subset of variables which contains all the useful information necessary to make inference about the variable in question (e.g. A in figure below). In other words, it pertains to all the necessary variables without which we do not have enough information to calculate the distribution of the node.

But why does conditional independence or study of d-sep and Markov blanket hold such significance in the study of Bayesian networks?
Well, that’s because the network encodes the joint distribution of multiple nodes connected with each other.
Using conditional independence assumption: "each variable is conditionally independent of its non-descendants, given its parents", the number of parameters to calculate the joint distribution reduce drastically, thereby relaxing the computation complexity (as the conditional independence takes into account only parent node ignoring everything else).
Alright, so we can now leverage the concept of conditional independence to model the dependencies in the Bayesian networks, but to achieve what? What kind of reasoning can we do with the Bayesian networks?
1) Diagnostic reasoning: Diagnostic is a backward-looking process. It tells you why something happened in the past by looking deeper into the data, analyzing it and identifying the patterns. Its equivalent to finding the root cause behind a certain event.
Let’s understand it with respect to the common effect network above where high cholesterol and diabetes both signal heart at higher risk. We diagnose the cause behind the patient’s heart disease state.
So, what we observe becomes the evidence and the variable for which we seek reasoning or inference becomes the query variables. In this case, Heart disease is the evidence and Diabetes and Cholesterol are the query variables.
We can derive the following inference from such network:
- P(D_1|H_1): probability of being diabetic given that the patient has heart disease
- P(C_1|H_1): probability of being high on cholesterol given that the patient has heart disease
2) Prognostic reasoning: Prognostic is a forward-looking process. It is similar to what machine learning algorithms typically do i.e. given the historical data and association; it makes a prediction about the future.
Given that a patient is suffering from Diabetes and High Cholesterol (evidence), what is the probability of the patient carrying Heart disease (query).
Following inferences can be drawn from such network:
- P(H_1|D_1): probability of having heart disease when he is diabetic
- P(H_1|D_1, C_1): probability of having heart disease if a person is both – diabetic and high on cholesterol
3) Intercausal reasoning:
It is different from the two discussed above as it involves the flow of information between the two causes.
Upon arrival of the evidence that the patient has a Family history (F) of high-cholesterol people with no Heart disease, the probability of the patient to have heart disease caused by Diabetes increases sharply as against the one caused by High Cholesterol.
Initially, the two causes were independent, but with the evidence of new variable ‘Family History’ that explains away the cause of the evidence, the probability of one cause ‘Diabetes’ increases drastically and shares the probability that the other variable ‘High Cholesterol’ is not a major cause of Heart disease for the patient.

With this, we have reached the end of this article. The concepts of graphical networks are generally deemed difficult. I have tried to set the intuition explaining the fundamentals of Bayesian networks in simple terms while retaining the conceptual essence.
My calling came in the form of PGM specialization from Coursera.
Hope this post serves as your calling and motivates you to learn more about the graphical networks in order to unleash their power.
Stay tuned for the next post where we will construct the Bayesian network, learn the parameters and perform inferences to understand our data better.
As always, happy reading, happy learning!!!
References:
http://www-prima.imag.fr/Prima/Homepages/jlc/Courses/2016/ENSI2.SIRR/ENSI2.SIRR.S13.pdf
https://www.coursera.org/specializations/probabilistic-graphical-models