Statistical Analysis of Biomedical Data — an Overview

From p-values and regression to clustering and classification

Published in

Towards Data Science

16 min readJun 25, 2021

The last decade has seen an enormous boom in the production of patient data. From wearables like smartwatches logging our heart rates to RNA sequencing for differential expression, our ability to monitor and observe individual patients’ health has never been more data-intensive. While this enables extensive statistical analysis of clinical data for diagnosis and research, statistical interpretations can be quite tricky, to say the least. Being able to understand and utilize statistical tools correctly can be a powerful addition to your arsenal. Former British Prime Minister Benjamin Disraeli supposedly once said, “There are three kinds of lies; lies, damned lies, and statistics”, although the actual origin of this quote is still unclear.

The Basics

Left: Example of a univariate distribution plot. Right: Example of a multivariate distribution plot. Figure created by the author.

All data is either numerical or categorical. Numerical data is either continuous (can be divided into infinitely smaller units, like drug concentrations) or discrete (cannot be divided into infinitely smaller units, like integers). Categorical data on the other hand is of three types: binary (yes or no, present or absent), nominal (named, like the 20 amino acids), and ordinal (named and has an innate order, like disease severity).

Univariate statistics deals with one variable of interest, while multivariate statistics deals with multiple variables. If we look at the distribution of heart rates of a group of patients, for example, it's univariate. But if we look at the heart rate on one axis measured against the age of the patients on another, then that's multivariate. Sometimes, multivariate statistics is referred to only when the dependent variable is measured as a function of two or more independent variables, as opposed to one.

To describe the central tendency of a distribution, statisticians use point estimates like the mean, median, or mode. To describe the spread of the distribution, standard deviation or variances are commonly used. Standard deviation in particular is a measure of variation in the dataset around the mean value.

Gaussian “Normal” Distribution. Mean is at 0. Figure created by the author.

This distribution on the left is also called the Gaussian distribution, named after the German mathematician who discovered it, Carl Friedrich Gauss. Most clinical data follows this Normal distribution. Because it is observed in numerous instances in nature, it has become an underlying assumption in most statistical approaches. Why does it show up so much? Central limit theorem. The more the number of samples, the closer the sample is to the target population.

A low standard deviation indicates a small range of data points that are close to the mean. A high standard deviation indicates a large range with the data points being further from the mean. 68% of the values fall between one standard deviation away from the mean on either side, 95% fall under 2 standard deviations, and 99% fall under 3 standard deviations. These percentages refer to the probability that a particular value is within that range.

The P-Value Conundrum

Inferential statistics builds upon descriptive statistics to determine if the observed difference between two means is purely by random chance or is due to an underlying hypothesized factor (also the called experimental hypothesis). This is called the univariate t-test. For instance, to determine if the blood oxygen saturation levels are different in smokers vs. non-smokers, one may use the unpaired t-test. A paired t-test is used when the same group is observed under two different conditions, like comparing muscle oxygen saturation levels of the same participants before and after exercising.

Blood O2 Saturation Levels. Figure created by the author.

The t-test renders a test statistic that corresponds to the minimum probability that the two means are the same (no real difference due to the underlying hypothesized factor, also called the null hypothesis) and that the observed difference is purely by random chance. This probability is called the p-value. A predetermined threshold is also chosen (which is usually 0.05 in most medical studies), such that, if the p-value is less than that, then the p-value is too small to be significant. For instance, if the p-value turns out to be 0.04, then the probability that the two means are the same is too small, indicating that the two means are indeed different. This is when we say the differences are statistically significant and reject the null hypothesis, consequentially accepting the experimental hypothesis.

CI = Confidence Interval. Alpha = 0.025+0.025 = 0.05. Notice how the interval ends at 2 standard deviations away from 0 (mean). Recall that 95% of the data can be found under the normal curve within 2 standard deviations, leaving the 5% outside. Figure created by the author.

The predetermined threshold is called the alpha value. It is usually 0.1, 0.05, or 0.01 and creates an interval (range) of values that must contain the true difference between the two groups. If this interval contains 0, then we accept the null hypothesis as it would indicate the true difference could be 0. If the interval does not contain 0, there must be some difference between the two groups. A higher alpha value indicates a higher precision of the estimate (smaller interval), but also lower confidence in the method used to generate this estimate (aka. the experiment). There is a trade-off. That’s why 0.05 is the most widely accepted value. So, one could wonder, if a study ends up having a p-value of 0.09, which is just 0.01 less than the alpha value of 0.1, how significant is the difference in practice? Is it really “significantly” significant? This is when statistics can be really tricky and deceptive. That’s why it’s always best to report the p-value regardless of the significance.

The t-test assumes random and independent sampling. Because the t-test also assumes Gaussian distribution, we cannot use it for other types of distributions. However, the Mann-Whitney U-test (also called the Wilcoxon Rank Sum test) can be used instead which compares the median of the two cohorts and renders a p-value that can be compared against the alpha value for significance. It is a more robust and powerful test than the t-test but the distribution must not be normal.

But what if we have too few data points? Another way to test the significance of the difference is a permutation test. First, we calculate the difference between the means of the two cohorts. Then, we simply shuffle and mix the data points from the two groups and find the new means and their difference. We repeat this process a few hundred times and plot the difference distribution, which tends to be like the Poisson distribution.

Original difference = 10. Number of times repeated = 240. Degree of rarity = 1/240 = 0.004 < alpha(=0.05). Therefore original difference is statistically significant. Figure created by the author.

If the original difference is rare enough compared to the differences between the randomly shuffled groups, then the two groups are statistically significant. Degree of rarity = (number of times the difference was at least as big as the original difference/number of times the difference was calculated). The degree of rarity is compared against an alpha value and if it is smaller than that, then the original difference is considered statistically significant.

So far we have discussed inferential statistics for comparing two cohorts. But more complex studies may have more than two cohorts to compare, each sampled independently and normally distributed. This is when the ANOVA (analysis of variance) test is useful. It compares the variance both within and between groups. There are 1-way, 2-way, 3-way, and n-way ANOVA tests which determine whether the means of groups are affected by the 1, 2, 3, or n independent variables. These tests render the F-value which follows the F distribution. The F distribution is basically a right-skewed Gaussian distribution. The idea is to compare the p-value of the F-value on this distribution. Similar to the t-test, if the p-value is smaller than the alpha, the variability of the groups is significant. Other tests like Tukey’s, or Bonferroni’s tests are used to determine exactly which groups are statistically different, once ANOVA rejects the null hypothesis.

Randomly generate F-distribution. F-value = 4. Figure created by the author.

Oftentimes, there may be outliers in the dataset. These heavily distort the mean value. Sometimes, the normal distribution may be skewed or multimodal. Other times, the binomial or exponential distributions may also be observed. Sequence alignment studies warrant sampling only the extreme ends of a normal distribution, which end up resembling the Poisson distribution. In these cases, standard statistical tests like the t-test, are really tough to be justified as they usually assume a normal distribution.

This is when you would want to look into changing the shape of the curve to make it resemble the normal curve as much as possible, aka. normalization. Depending on the scope and demands of the study, one may decide to exclude outliers completely, or cap all outliers up to a maximum value. Changing the scale of the x-axis to a log transformation also makes it more normal as it brings the outliers closer to the mean making it more symmetrically distributed. Substance concentrations often vary between milli, micro, and nano scales, resulting in multimodal distributions. Using an appropriate and uniform scale could normalize the distribution to be unimodal.

The Correlation Coefficient of Regression Analysis

Another common research objective is to measure how much changes in one variable explain changes in another. Generally, scatter plots are used to illustrate this relation. There are two key statistics in this area: correlation coefficient and linear regression analysis. It’s a common mistake to use them interchangeably but they do have some distinguishing features.

Left: Perfect positive correlation. Middle: Perfect negative correlation. Right: No correlation. Figure created by the author.

The correlation coefficient (R) measures the strength and direction of the relation. Its value ranges from -1 to 1. The strength of the correlation is represented by how close the R-value is to 1 or -1. So if R=0.9, then it's a very strong positive correlation, i.e. a change in X is almost proportional to a change in Y. If R=-0.8, then a change in X is almost inversely proportional to a change in Y. But if R~0, then changes in X are not proportional to changes in Y. Correlations are useful for assessing predictions, estimators, simulations, or dependencies. Too few samples can lead to unreliable data. It can be abused by selecting only the extreme data points and leaving out the ones in the middle for a misleadingly high R-value. Naturally, the more data points we have the more confidence we have in the analysis. But again, this may be abused too as with sufficiently large datasets, a statistically significant p-value is easier to get when in practice, it may not be “significantly” significant.

R is often confused with R² (coefficient of determination). R² is actually more related to linear regression. It is a measure of how well the collective variance of the X (independent) variable explains the variance of the Y (dependent) variable and ranges from 0 to 1. Linear regression renders a whole equation that is used to predict this effect of X on Y. Unlike correlation, the variables are not interchangeable although both values may be determined algebraically. The X variable is considered the explanatory variable while the Y variable is considered the response variable. X explains the changes in Y. Y responds to the changes in X. Not the other way around.

Left: Linear regression plot. Right: Assessment of predictor function derived from the left plot. Derived predictor linear regression function: y = 0.181x + 0.008. R = 0.973. R² = 0.947. Figure created by the author.

In the example above, we compare the effect of changing enzyme concentration on haemoglobin concentration. There is a strong positive correlation (R=0.973) and 94.7% of the variation in haemoglobin concentration is explained by enzyme concentrations (R²=0.947). Linear regression analysis gives us the equation: [haemoglobin] = 0.181[enzyme] + 0.008. Using this equation, we are able to predict haemoglobin concentration from any concentration of enzyme. In the next plot, we assess the performance of this predictor function against the observed concentrations of haemoglobin. Ideally, the perfect predictor function would result in all the points being on the diagonal from (0,0) to (max(x), max(y)). As most of the points in the graph do fall on this diagonal, the performance of the predictor function is very high, as indicated by the R² value. Linear regression is the simplest form of statistically modeling data. Other regression models include logistic, polynomial, Bayesian linear regression, and many more. However, it is important to be wary of overfitting the model to the training dataset which is a common issue in applications involving machine learning and neural networks.

The Curious Case of Clustering and Classification

From figuring out the different metabolites in a solution to predicting hospital readmission, microarray, GeneChip, and protein interaction analysis, clustering and classification are two of the most common statistical tools used in biomedical studies.

Clustering is grouping together objects that have some common features. It aids in classification. The latter varies from clustering in the sense that, the commonality is predefined. Some prior knowledge about the objects is applied as the basis of grouping them together, whereas, in clustering, knowledge is mined from the objects themselves to discover the commonality. While clustering is an unsupervised machine learning method, classification is supervised.

For clustering, we need:

a measure of similarity and dissimilarity between objects
a threshold to determine if two objects are similar, based on the measure
a way of measuring the “distance” between the objects
a cluster seed, where the cluster starts

For measuring similarity, we can use the Pearson correlation coefficient or the Euclidian distance or just simply the difference. There are two main approaches to clustering, i.e. K-means and hierarchical.

K-means clusters n number of objects into m clusters, with or without overlaps. First, chose an object as the centroid for the first cluster. Then for the second object, calculate its similarity to all pre-existing centroids (which would be just the first object initially). If the similarity meets the pre-determined threshold, add the object to the centroid with the highest similarity and recalculate the centroid, else make it a centroid and start a new cluster. Repeat until all objects are used up.

First we compare the distance between the two points and calculate the new centroid (light green dot). The first two red dots are clustered together as their distance meets a predetermined threshold. Then we calculate the distance between the new centroid and the third point at x=6. As they are close enough, they are clustered together and the new centroid calculated (orange dot). Similarly, the two points at the corner would form a second cluster as they are too far from the first one. Figure created by the author.

Hierarchical clustering is a bottom-up approach to pairing the most similar objects together into one single cluster. The result can be illustrated in a graph called the dendrogram which looks like a tree diagram or heatmap. These are very useful for phylogenetic sequence analysis.

Heatmap from hierarchical clustering of genomic data analysis forming 4 distinct clusters (C1-C4). Ogden et al. (2020).

Most biomedical data tend to be multi-dimensional, i.e. we measure many variables together, for instance, concentrations of all essential amino acids, or mRNA and protein microarrays. Analyzing multivariate data requires somehow reducing these dimensions into one or two or three so that they are easier to understand. It allows the application of classical concepts like the t-tests, p-value, or ANOVA on the same datasets.

One powerful tool for dimensionality reduction is Principal Component Analysis (PCA). Essentially what it does is that it groups together a bunch of variables that have high correlations between them and best explain the variations in the observed dataset. This results in 100s of variables being reduced down to two or three key features as the principal components. The grouped components themselves generally tend to be uncorrelated as they describe different aspects of the observed dataset which vary differently.

Example of a 3D scatterplot after reduced dimensions by PCA, with 3 clusters. PC=Principal Component (1,2,3). Figure created by the author.

It also results in the original dataset being recalibrated to the principal components with similarly varying data points clustered together. Generally, if PCA fails to cluster the dataset even to a small degree, then it is wise to accept the possibility that there is not much distinguishing factor in the data in the first place.

While PCA is a great tool for dimensionality reduction, it should not be used with dimensions of categorical variables. Unlike numerical variables, they do not tend to vary a lot when encoded artificially. This leads to their effects being undermined implying that they do not explain the variations in the observed dataset as much as the numerical ones and therefore PCA could be misleading.

Figure taken from Gromsky et al. (2015).

However, if we have some prior knowledge about the data at hand and want to classify them based on this knowledge, a second tool called Partial Least Squares Discriminant Analysis (PLS-DA) can be used. It requires training on previously labeled data. It can then predict the categorical classification of the newly obtained datasets. Indeed, it is a precursor to the now widely used machine learning algorithms for supervised classification. PLS-DA is a very common classification technique in metabolomics.

Extrapolating from PLS-DA we can also determine which variables are the most effective ones using a plot called the Variable Importance in Projection (VIP). It clearly quantifies the importance of each of the variables. Any score above 1–1.5 indicates an important variable. These scores can then be used for feature selection.

VIP plot from genomic data analysis. Figure taken from Ogden et al. (2020).

The Evaluation Techniques

For evaluation of predictive models like the PLS-DA or really any other ML classification models that predict categories and binary classes, statisticians use the confusion matrix.

Example of a confusion matrix for a binary classification model. Figure created by the author.

The predictions are compared with the observed classes (ground truth) and then classified into four groups; true positive, true negative, false positive (Type I Error), and false negative (Type II Error). The model is then assessed to calculate the sensitivity and specificity.

Sensitivity (Sn) = TP Rate = TP/(TP+FN) = number of observations that were correctly predicted positive, out of all observed positives

Specificity (Sp) = TN Rate = TN/(TN+FP) = number of observations that were correctly predicted negative, out of all observed negatives

This sort of assessment is also used in evaluating sequence predictions. For instance, if a model predicts the coding and non-coding regions of a length of gene, we can depict them like this:

Oftentimes a model will need to prioritize one over the other. Liberal models (ones that make more risky predictions) tend to be more sensitive but less specific while conservative models (ones that make less risky predictions) tend to be more specific but less sensitive. Whether a model is liberal or conservative mainly depends on the predefined threshold it uses for classification.

If a model predicts that there is a 75% chance of a patient having a cardiac disease, a liberal model is more likely to make a positive prediction as its predefined classification threshold is likely to be lower than that of a conservative model. If a model predicts that there is a 75% chance of a patient requiring a surgical procedure, a conservative model is more likely to make a negative prediction as its predefined classification threshold is more likely to be lower than that of a liberal model. Models used for screening patients tend to be more liberal as the suggestions made have less serious consequences than models used for surgical decision tests. The performance of a model in terms of its sensitivity and specificity are well illustrated in the Receiver Operating Characteristic (ROC) curves.

Example of a binary ROC curve. Figure created by the author.

The red line shows the different combinations of True and False positive rates as the predefined threshold is changed. The black circle is where the model is balanced. The black diagonal shows the rates if we were to randomly guess the binary classes.

Another way to compare ROC curves for different predictor models is using the area under the curve (AUC). Randomly guessing should be the worst predictor and have an AUC of 0.5. A perfect predictor has an AUC of 1 and moderate predictors have AUCs somewhere in the middle.

Perfect AUC of 1 (in red) and AUC of 0.5 (in black). Figure created by the author.

Another method to assess the performance of classification models like PLS-DA is through permutation testing. Similar to what we did before with univariate data, first we apply our PLS-DA model on the correctly labeled data for training and determine the separation coefficient, which is basically a measure of how well the data could be separated based on some principal components. Then we shuffle the labels and rerun the PLS-DA model and determine its separation coefficient and repeat a few hundred times. The separation coefficients should be normally distributed, and if the original coefficient was based on some underlying factor, then it should be significantly larger than the rest of the distribution, indicating that it was not a one-off random event.

Supervised classification methods are powerful and can often group objects into clusters when there is no real connection between the objects. It is important to not skip t-testing, regressions, PCA, and unsupervised methods as these do not consist of the prior bias present in supervised methods. Statistics is formalizing intuition. As a result, it is important to be able to trust your intuitions before blindly following some formula.

This was a very non-mathematical overview of the everyday statistics used in biomedical research. The maths behind each of these varies on a wide spectrum; from finding the mean of a set of data points to doing PLS-DA for descriptive, predictive, and discriminative feature selection purposes. A programmer must know the data structure of a library before using it. Understanding its implementation is not necessary. A doctor must know the pharmacokinetics of a vaccine before prescribing it. Understanding the specific molecular mechanisms of a vector biomolecule used in the vaccine is not necessary. Similarly, it is not essential to understand the maths but it obviously helps. However, what is essential is comprehending their underlying assumptions, limitations, and behavior.

If you are interested in learning more about these statistical tools with some mathematical explanations, then this YouTube playlist by Josh Starmer is a great resource.

References:

Ogden, A. J., Wietsma, T. W., Winkler, T., Farris, Y., Myers, G. L., & Ahkami, A. H. (2020). Dynamics of Global Gene Expression and Regulatory Elements in Growing Brachypodium Root System. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-63224-z
Gromski, P. S., Muhamadali, H., Ellis, D. I., Xu, Y., Correa, E., Turner, M. L., & Goodacre, R. (2015). A tutorial review: Metabolomics and partial least squares-discriminant analysis — a marriage of convenience or a shotgun wedding. Analytica Chimica Acta, 879, 10–23. https://doi.org/10.1016/j.aca.2015.02.012
David S. Wishart, An Introduction to Statistics for Metabolomics, The Metabolomics Innovation Center (TMIC), https://www.youtube.com/channel/UC4CHrL7t9brRTIomi9rDLHA

The cornerstone to the next revolution — DeepMind’s AlphaFold 2

How AI just solved the 50-year-old grand challenge of protein folding and modeling

towardsdatascience.com

4 Reasons Your Machine Learning Model Is Underperforming

a schematic approach to building better ML models

towardsdatascience.com

Python vs. R for Data Science

and why you are wasting your time

towardsdatascience.com

P.S. If you want more short, to the point articles on Data Science, Programming and how a biologist navigates his way through the Data revolution, consider following my blog.
With thousands of videos being uploaded every minute, it’s important to have them filtered out so that you consume only the good quality data. Hand-picked by myself, I will email you educational videos of the topics you are interested to learn. Sign-up here.
Thank you!

Statistical Analysis of Biomedical Data — an Overview

From p-values and regression to clustering and classification

The Basics

The P-Value Conundrum

The Curious Case of Clustering and Classification

The Evaluation Techniques

The cornerstone to the next revolution — DeepMind’s AlphaFold 2

How AI just solved the 50-year-old grand challenge of protein folding and modeling

4 Reasons Your Machine Learning Model Is Underperforming

a schematic approach to building better ML models

Python vs. R for Data Science

and why you are wasting your time

Written by SudoPurge