Graph Analytics in Identifying Enduring Effects of Pandemics

Graph analytics framework designed for identifying enduring effects of COVID-19 on Healthcare Systems using Exponential Random Graph Models (ERGMs)

Jinhang Jiang
Towards Data Science

--

Image by author

Content

  1. Introduction
  2. The Graph Analytics Framework
  3. Explanatory Modeling with Exponential Random Graph Models
  4. Implementation of ERGMs
  5. Conclusion

Introduction

One of our recent studies published in the ACM Transactions on Management Information Systems (ACM TMIS) has shown the potential of graph analytics in identifying the enduring effects of the COVID-19 pandemic. You may read the original article here:

We developed a graph analytics framework to investigate the simultaneous occurrence of multiple diseases in patients, known as disease multimorbidity, before and during the COVID-19 pandemic. By analyzing Electronic Health Records (EHRs) data, we were able to identify patterns and structures that could be used to detect the long-term effects of the virus.

Our study found that graph analytics can provide more accurate and comprehensive insights into the effects of COVID-19 on patients compared to traditional methods. By capturing complex relationships and dependencies in Electronic Health Records (EHRs) data, graph analytics can help healthcare providers better understand and address the long-term effects of the virus. These findings also have important implications for the ongoing efforts to combat COVID-19 and can support researchers and policymakers in developing effective interventions and treatments.

The Graph Analytics Framework

Image by author

The proposed framework builds upon existing information systems that analyze hospital discharge records. It uses graph analytics for multimorbidity analysis in addition to traditional analytics practices such as ranking disease prevalence or using data mining to predict outcomes. The framework is composed of three interlinked components: exploratory analysis, explanatory modeling, and predictive modeling. It is created empirically from hospital discharge datasets with diagnosis codes (ICD-10 code) and can be converted into an unweighted undirected network by setting a suitable threshold on the edge weight distribution. Within the exploratory analysis, structural information of the network and its nodes can be examined, and newer methods such as DeltaCon or NetSimile can be used for comparison. Explanatory modeling uses mathematical or statistical graph models (such as ERGMs), while predictive modeling uses embedding vectors of nodes and edges of the network to represent latent features or fit a high-dimensional non-linear model for prediction.

If you are interested in how to get embedding vectors of nodes and edges of the disease network, please refer to the following blog:

Explanatory Modeling with Exponential Random Graph Models

ERGMs are statistical models that analyze networks and explore the patterns and structures that underlie their formation. ERGMs are based on the concept that the probability of a specific network structure can be expressed as an exponential function of graph statistics, such as the number of edges or triangles in the network. In this study, we utilized exponential random graph models (ERGMs) to estimate the probability of two diseases forming a link, known as multimorbidity. We then compared the coefficients of different disease categories before and during the COVID-19 pandemic to interpret the enduring effects of the pandemic on healthcare systems.

One of the key algorithms used in ERGMs is the Markov chain Monte Carlo (MCMC) algorithm, which is a method for sampling from a probability distribution in order to estimate the parameters of the model. MCMC uses a random walk process to explore the space of possible model configurations and iteratively updates the estimates of the model parameters based on the observed data. Another important algorithm used in ERGMs is the Metropolis-Hastings algorithm, which is a variant of the MCMC algorithm that allows for more efficient exploration of the model space.

One advantage of ERGMs is that they can incorporate exogenous variables, such as individual-level characteristics, to explain the formation of the network. This allows researchers to test hypotheses about how different factors influence network structure and to make predictions about how the network will evolve over time.

Another benefit of ERGMs is their ability to generate synthetic networks with structural properties similar to the observed data. This can be useful for augmenting limited data sets or simulating the effects of interventions on network structure.

One disadvantage of ERGMs is their computational complexity, particularly for large networks. They also require a large number of observations to produce reliable estimates, making them unsuitable for small or sparse data sets. The exponential function used in ERGMs can sometimes generate unrealistic predictions, especially for networks with complex structures or when the model is misspecified. ERGMs have other drawbacks as well, such as the potential for inaccurate or misleading results from the exponential function and sensitivity to assumptions and modeling choices. This can make it difficult to compare results across studies or replicate findings using different data sets or methods. Therefore, researchers should carefully consider the limitations and potential biases of ERGMs when using them to analyze network data.

Implementation of ERGMs

In this section, I will demonstrate some simple codes for the implementation of ERGMs in R. If you are interested in details, please read the original documents here: Package ‘ergm’.

The ergm package in R provides a comprehensive set of functions for fitting and analyzing ERGMs. To fit an ERGM, the ergm() function can be used, which takes as input the data frame containing the network data and a formula specifying the model to be fit. For example, the following code fits an ERGM to a network stored in the network data frame with a model that includes an edge-covariate term and a term for the transitivity of the network.

library(ergm)

# Fit the model
model <- ergm(formula = network ~ edges + transitivity, data = network)

Here are some additional core parameters of the ergm() function:

  1. nodecov: This term adds a single network statistic for each quantitative attribute or matrix column to the model equaling the sum of attr(i) and attr(j) for all edges (i, j) in the network. (Numeric)
  2. nodefactor: This term adds multiple network statistics to the model, one for each of (a subset of) the unique values of the attr attribute (or each combination of the attributes given). (Categorical)
  3. control: This is the user interface for fine-tuning ’ergm’ fitting. For example, inside control, you can set the max iterations, seed, and the number of proposals between sampled statistics.

With the additional parameters, the codes can be rewritten as the following:

library(ergm)

# Fit the model
model <- ergm(g1 ~ edges+nodecov(~Numeric_Variable1)
+ nodecov(~Numeric_Variable2)
+ nodefactor("Categorical_Variable1",
levels=c("a", "b", "c"))
+ nodefactor("Categorical_Variable2",
levels=c("a","b")),
control=control.ergm(MCMC.interval = 10000,
MCMLE.maxit = 100,
seed = 42))

The summary() function can be used to obtain summary statistics and diagnostic plots for the fitted model. The function takes as input the fitted model and produces a summary report with information on the model fit, the model coefficients, and the goodness-of-fit measures. For example, the following code generates a summary report for the fitted model:

# Generate a summary report
summary(model)

If you are interested in learning how to manipulate the networks in R, please read here:

Conclusion

In this study, a graph analytics framework and an ERGM-based explanatory graph model are developed to examine disease multimorbidity recorded in hospital discharge records in Arizona before and during the COVID-19 pandemic. According to our analysis, we observed that while multimorbidity increased by 34.26% and 41.04% for mental disorders and respiratory disorders, respectively, during the peak of the pandemic, the gradients for endocrine diseases and circulatory disorders were not significant. We also found that multimorbidity for acute conditions decreased during the pandemic, while multimorbidity for chronic conditions remained unchanged.

The framework and model can be used on any standardized Electronic Health Records database containing ordered or unordered lists of diagnosis codes. The study informs future research incorporating multimorbidity into problem scenarios such as disease risk prediction and feature engineering for modeling health outcomes. The explanatory graph model can also be used to complement deep learning and data mining modeling approaches. The study has implications for health analytics researchers and policymakers, providing tools for analyzing disease multimorbidity patterns and guidelines for pre-emptive actions for averting public health crises.

--

--