Hands-on Tutorials

Applying Bayesian Networks to Covid-19 Diagnosis

Probabilistic decision making in highly complex scenarios

Alvaro Corrales Cano
Towards Data Science
11 min readFeb 24, 2021

--

Photo by Marcelo Leal on Unsplash

Motivation

It is not news that the Covid-19 pandemic has put under strain parts of our societies that we had for long taken for granted. Healthcare services is perhaps one of the clearest examples. Our healthcare professionals have faced a virus that they knew very little about, having to make decisions with too often scarce information and limited resources. In this context, the Data and AI community has stepped in to contribute to the fight against Covid-19 by leveraging what they are best at: processing data to make informed decissions. This has translated into calls for code, Kaggle challenges and even not-for-profit corporate alliances.

Inspired by this collective effort, in these lines we show how Bayesian Networks can be useful for decision makers to make sense out of complex information using a probabilistic approach. In particular, after giving a brief introduction to the concept of Bayesian Networks, we show how to apply this concept to a real world dataset from the Kaggle challenge Diagnosis of Covid-19 and its clinical spectrum. This dataset contains anonymised medical data from patients that were tested for Covid-19 at the Hospital Israelita Albert Einstein in São Paulo, Brazil, the sponsor of the challenge.

In the organiser’s own words, the motivation for the challenge was the fact that in the context of an overwhelmed health system with the possible limitation to perform tests for the detection of SARS-CoV-2, testing every case would be impractical and tests results could be delayed even if only a target subpopulation would be tested. The structured knowledge that Bayesian Networks provide is very well suited for this type of situation. As we show in this blog, with Bayesian Networks complex decision making becomes much easier.

An introduction to Bayesian Networks

Bayesian Networks (BN) are a well-established technique for handling uncertainty within the AI community, to the point that some consider them a capstone for modern AI. As professor Stuart Russell puts it, “BN are as important to AI and ML as Boolean circuits are to computer science.” While one could argue about the extent of this quote, it is clear that BNs are quite popular in the AI community. But, what are BNs exactly?

BN are a class of Probabilistic Graphical Models renowned for their capabilities to reason under uncertainty. They can be seen as a probabilistic expert system: the domain (business) knowledge is modelled as a direct acyclic graph (DAG). The DAG links, or arcs, represent the probabilistic dependencies between nodes/variables in the domain. These dependencies don’t need to be causal, but can represent a form of correlation or dependency. To put it more formally, a BN is a Joint Probability Distribution (JPD) over a set of random variables. It is represented by a DAG, where the nodes are identifed with random variables and the arcs express the probabilistic dependence between them.

BNs have been successfully implemented in different industries. Some of their many applications include risk assessment (cancer, water, nuclear safety …), industrial process simulation, monitoring the health of machines (troubleshouting, defect, failure etc ..) and predictive maintenance.

More generally, BNs can be used as a tool for diagnosis, prediction (aka prognosis) or probable explanations of an observation. In a diagnosis application, we could perform reasoning from consequences to causes following the graph (usually the opposite direction of the arcs in the BN). For example, in our dataset, observations about blood cell counts can be used to update the expert’s belief about the need of admission to ICU. Once we have the BN, we can also start reasoning under uncertainty or, in other words, making what-if type of questions through probabilistic inference. This consists of computing a posterior such as P(Covid-19 = positive| Platelets = 3).

Going back to or initial motivation, let’s consider the following thought experiment. Take an Intensive Care Unit (ICU) of a hospital during the current pandemic. Healthcare professionals need to have an idea of whether a person will require intensive care in the immediate future to better allocate their resources in moments of great strain. The admission to ICU can be due to Covid-19 or not. Some bioindicators are currently associated with a higher probability of having Covid-19, and can be relatively easily identified through standard tests (see for example this paper by Lippi and Plebani [2020]). Similarly, other indicators not necessarily linked to Covid-19 can also be related to other commorbidities that increase the probability of being admitted to ICU. Such a situation can be modeled using a BN, whose DAG structure is shown in the image below (the image is purely illustrative):

Illustrative DAG — Image by the authors

The BN associates a conditional probability table (CPT) to each node given its parents in the DAG. In our example, the node in the top layer, Bordetella pertussis, doesn’t have any parents; Adenovirus’s parent node is Bordetella pertussis, while its children are CoronavirusNL63 and Strepto A; and in turn Strepto A would not have any children. In this formalism, we consider variables to be discrete. The values of each variable are actually the possible modalities that this variable can take (state) such as 0 or 1 for covid_19.

What’s more importnat, the BN factorizes the JPD 𝑃 as the product of these CPTs. For the network depicted above, our JPD would be:

P (Bordetella pertussis, Adenovirus, CoronavirusNL63, Coronavirus HKU1, … , Parainfluenza 1, Chlamydophilia pneumoniae) = P (Bordetella pertussis) x P (Adenovirus | Bordetella pertussis) x P (CoronavirusNL63 | Adenovirus, Bordetella pertussis) x P (Coronavirus HKU1 | CoronavirusNL63, Bordetella pertussis) x … x P (Parainfluenza 1 | Bordetella pertussis, Rhinovirus/Entenovirus) x P (Chlamydophilia pneumoniae | Parainfluenza 1, Bordetella pertussis)

It is possible to obtain these CPTs in two different ways: we can either use a frequentist approach by learning a structure from a historical dataset, or we can elicit the CPTs from our domain knowledge. Oftentimes, the most powerful approach will be a combination of both. This is an especially useful characteristic of BN: in practice, we would first want see what the data has to say, so we would use the frequentist approach to learn a structure from it. This will almost always give us new insights on our data and even challenge our previous assumptions about it. Once we have our network, we can apply our domain knowledge to confirm or reject what we have computed via the frequentist approach if we know that certain links have to be there, or some are just impossible. For instance, in healthcare context, we may want to impose a structure such that admission to hospital is the consequence rather than the cause of an illness.

Another important characteristic of BN is the memory gains that we get from them when making probabilistic reasoning. This comes quite handy when dealing with wide datasets. In the case of our 111-column wide dataset, we would need to store 6347497291776 entries to store the JPD, which is the product of all modality sizes in the JPD. Using the BN, we can encode this compactly as we saw in the factorisation above, meaning that we now we only need to store 817 entries. This is a memory gain of 99.99999998712879%!

The graphical structure also encodes very interesting information that can be used to derive insights about the data: every node is conditionally independent of its non-descendants given its parents in the DAG. Based on this conditional independence, one can also show that a node is conditionally independent of all nodes given its Markov blanket, i.e., the union of the parents of X, of the children of X and of these children’s parents. In other words, all we need to know to update our belief about a variable is its Markov blanket.

For instance, in our example dataset, to update our knowledge on the probability that a patient will test positive for Covid-19, we would only need information on the variables Inf A H1N1 2009, Influenza B, Platelets, Coronavirus HKU1, Respiratory Syncytal Virus, regular_ward, and Rhinovirus/Enterovirus.

Markov blanket of covid_19 variable — Image by the authors

BN have also some caveats. Their power relies on their modeling of probabilistic interaction in complex systems, where human reasoning is weak. BN are mathematically sound and are grounded in theory to have a normative approach to deal with uncertainty. However, they are not particularly powerful tools for classfication tasks in general, although they can used as classifier. This is due to the loss of information we impose by working with states, i.e. discrete variables. There is ongoing research to find more effective ways of discretising continous variables to use them in BN, but the area is relatively new and is outside the scope of this blog.

Application to a real-world dataset

Both the structure and the parameters (CPT) of a BN can be learned from a dataset. In our case, the Hospital Israelita’s dataset is composed of 5644 observations, each corresponding to a patient that was tested for Covid-19. Alongside the results of the test, we’ve got information on whether or not the person was admitted to a regular ward of the hospital, semi-intensive or intensive care, as well as the results of various other health tests. Note that in this application, we will think of our dataset as a set of random variables with probabilistic interactions.

The structure of the dataset reflects the reality of a hospital: no two patients are the same, so doctors often need to prescribe different types of tests for each patient upon admission. Hence, our dataset contains a big amount of missing values — in some columns as much 99% of the total. There are several, well-known techniques in the field of Data Science to deal with missing values while keeping as much information as possible. On top of the classic mean/median/mode imputation techniques, clustering and PCA algorithms can be used to fill these missing values when working with BNs. Since the purpose of our blog is mainly educational and we want to keep it as simple and clear as possible, we simply drop all columns with more than 95% missing values. Out of the remaining variables, discrete variables will be filled with the value -999, while continuous variables will be filled with the median. This is motivated by the categorical nature of BNs: filling discrete variables with -999 effectively creates another state, leaving the others untouched. Given that all our continuous varaibles have zero mean and standard deviation of one, flling them with the median ensures that we are still able to capture the variation from the actual values when we discretise them.

As we’ve said before, our BN works better with discrete variables, so we need to discretise our continous variables. For the purpose of our analysis, we have split the data into buckets based on the belonging to a percentile bucket. We take four asymmetrically bounded buckets. Given our amount of missings is substantial and that we filled values with the median, we are interested mostly in the variations in the extremes of the distribution. Therefore, our first bucket is bounded by percentiles 0 and 5, whereas our last bucket is bounded by percentiles 95 and 100. Everything in between split by percentile 50, forming our two middle buckets.

An now… we are ready to estimate our first BN! As an initial attempt, if we try to learn the structure only from the data without any business or domain knowledge we get the following structure:

Learned DAG with no business domain constraints — Image by the authors

The learned structure may contain some relation that doesn’t make sense from a domain knowledge point of view. The BN framework is flexible to allow a domain expert to intervene to introduce the human knwoledge in the learning process. Additionally, thinking in terms of prediction, we may want to enforce certain arcs from some predictive variables to our target variable(s). After imposing some further constraints, we will get a slightly different BN, as shown in the image below:

DAG after including business domain constraints — Image by the authors

Once we have fine-tuned the structure of our BN with our domain knowledge, we can start quering it about the learned CPT. Let’s have a look at the CPTs for our regular ward admission indicator. For this particular structure (not necessarily the optimal), the CPT is only determined by the Covid-19 and Influenza B diagnoses. In terms of interpretation, we could say that a patient that tested negative for both Covid-19 and Influenza B has only a probability of 2.95% of needing admission in the regular ward of the hospital.

CPT of regular_ward variable — Image by the authors

Remember that our BN allows us to ask about what-if scenarios. For instance, we can ask What’s the probability that a patient will test positive for Covid-19 given that she presents these levels of indicators A and B? The answer to this question can be found in the posterior for the variable covid_19. Let’s evaluate this posterior for a patient that has been detected Influenza B but not Coronavirus HKU1 and she is in bucket number 1 of the Platelets variable. The probability that the patient has Covid-19 is as high as 52.89%:

Posterior probability of having Covid-19 for a patient with Influenza B but without Coronavirus HKU1 — Image by the authors

It’s worth noting that the probabilistic engine used only nodes from the Markov blanket to derive such information. To verify this, let’s play again with the Markov blanket of the variable covid_19. As we showed above, the corresponding Markov blanket was composed of the variables Inf A H1N1 2009, Influenza B, Platelets, Coronavirus HKU1, Respiratory Syncytal Virus, regular_ward, and Rhinovirus/Enterovirus. For any set of values for each of these variables, we can get a posterior probability of the variable covid_19. In this particular example, we have given the Marvok blanket the following random values {‘Platelets’: ‘3’, ‘Inf A H1N1 2009’: ‘not_detected’, ‘Influenza B’: ‘detected’, ‘Respiratory Syncytial Virus’: ‘detected’, ‘Coronavirus HKU1’: ‘detected’, ‘Rhinovirus/Enterovirus’: ‘not_detected’,’regular_ward’: 0}, which yielded the posterior in the image below:

Posterior probability for covid_19 variable given its Markov blanket — Image by the authors

This combination of values in the Markov Blanket implies a probability that the patient tests positive for Covid-19 of 56.21%. Let’s now add a couple of observations of nodes outside of the Markov blanket and see the change. For example, let’s add Metapneumovirus and Influenza A rapid test:

Posterior probability of having Covid-19 given parameters outside its Markov blanket — Image by the authors

…which yields no effect on our posterior probability for Covid-19. Note the power of this property of BNs: in the context of a big and potentially complex network, we only need to know about a smaller subset of nodes to model what-if scenarios about a variable of interest.

Main takeaway

In this blog we introduced Bayesian Networks and highlighted how relevant they are for decision making in uncterain situtaions, with a specific application in the context of the current pandemic.

As a graphical representation, these expert modelling tools are very intuitive and simplify much the analysis of complex systems. Their value resides mainly on the network structure and probabilistic linkages that they unveil for complex datasets. This makes them extremely beneficial for decision makers subject to great uncertainty, as it has already been proven in many industries and we have demonstrated with our example dataset.

The code for this analysis is available in our Github repo.

We used the pyAgrum library to do the analysis. Visit its website to learn more about programming Bayesian Networks in Python.

Authors:

  • Hamza Agli is an international Data Science and AI leader with a PhD in Aritificial Intelligenge and Decision Making.
  • Álvaro Corrales Cano is a Data Scientist. With a background in Economics, Álvaro specialises in a wide array Econometric techniques including causal inference, discrete choice models, time series and duration analysis.

--

--

Data Scientist & Economist — Interested in Machine Learning, especially causal inference, time series and NLP — Ancient history lover