In single-cell RNA sequencing data (scRNA-seq), which records the number of mRNA molecules across the genome in individual cells, we frequently encounter low or zero counts of these molecules. This is due to the sequencing machines’ low detection efficiency of more sparsely-expressed genes, resulting in them being falsely labeled as zeros. This adds a lot of noise to scRNA-seq data, which can make downstream analyses such as dimensionality reduction, differential gene expression, and trajectory inference potentially dubious. Specifically, this noise can obscure known biological relations by genes of interest.
Wouldn’t it be great to de-noise that data and bypass that hurdle? This is where Imputation comes in. Imputation is a way to fill in or impute missing data (i.e., zero counts) by making educated guesses at what the true values of these missing data entries should be. De-noising is one type of imputation paradigm, in which you estimate the noise and remove it accordingly. One simple, yet effective approach for achieving that is an algorithm from the Krishnaswamy lab (the same lab that developed PHATE, a dimensionality reduction algorithm with competitive performance to the more commonly-used UMAP) called MAGIC. MAGIC stands for Markov Affinity-based Graph Imputation of Cells. This uses data diffusion on a graphical representation of single-cell data to remove noise.
If you’ve read my PHATE article, the steps below will be pretty familiar, but simply put, the algorithm starts with a matrix of single-cell data (but you could easily substitute in your own dataset of this matrix format) and calculates the Euclidean distances between cells, creating a cell-to-cell distance matrix (depending on the dimensionality of the data, we typically determine these distances in PCA space with about 10–100 Principal Components). This matrix is then transformed from distances to affinities. This measures similarity and is inversely proportional to distance such as that the closer two cells are in Euclidean space (i.e., the smaller their distance), the greater their affinity. This calculation is achieved using an exponential kernel:

where x and y each correspond to the coordinates of cells in PCA space, d(x,y) is the Euclidean distance between x and y, and σ is a bandwidth parameter and measures the "spread" or radius of neighborhoods captured by this kernel. This is key to capturing the local variation within a neighborhood of cells. The choice of σ varies with each set of cells to correct for the differences in densities across the single-cell graph. For instance, some cells may occupy sparse regions with few neighboring cells, while others are packed with a bunch of cells. Thus, we want a "wider" bandwidth for sparse areas, and a shorter one for dense areas to correct for this neighborhood variation. Therefore, we set σ to be the distance between x and its _k_th nearest neighbor:

where xₖ is the _k_th nearest neighbor of cell x. The authors of MAGIC recommend that this k be small to ensure connectivity across the graph. However, because σ is cell-specific, our affinity matrix is no longer symmetric (i.e., there may exist a pair x,y such that A(x,y) ≠ A(y,x)). This can easily be corrected by adding the transpose of our affinity matrix to the original matrix A.

After we’ve calculated our affinity matrix, we convert these values to probabilities by row-normalization, obtaining a Markov transition probability from cell x to cell y in a single step for all cells.

Because the drop-out effect presents a form a technical variation that’s often indistinguishable from biological variation, randomly sampling features (in this case, mRNA molecular counts for each gene) can result in dubious variation between cells that are known a priori to be identical (e.g., stem cells or red blood cells). However, thinking about this data as a graph, where nodes are cells, and edges represent connections between similar cells, you could suppose that similar cells share many neighbors, while dissimilar cells have fewer mutual neighbors. Is there a way we could leverage this information such that cells with sparse counts could learn from their less noisy friends? This is where data diffusion comes in by powering our Markov probability matrix to an exponent t. This can thought of intuitively as the probability of transitioning from cell x to cell y within t steps of a random walk. This t is known as the diffusion time. Thus, the cells have more neighbors to traverse to and learn from in determining its gene expression profile. This powering of the matrix thus increases the affinities of similar cells, while ironing out noisier connections. In the context of graph signaling, you can think of this as a low-pass filter where as we increase our t, we are not only strengthening the connections between related cells through the affinities, but also removing noise by down-weighting false connections that emerged out of the sparsity.
This powered matrix is then directly multiplied by the original matrix of scRNA-seq data to obtain an imputed matrix with the noise filtered.

Where Mᵗ is the probability matrix powered by the diffusion time, t, X is original single-cell matrix of m cells and n genes, and X_impute is the imputed gene expression matrix.
To recap, we first calculate a cell-to-cell distance matrix from the preprocessed original data, convert the distances to affinities, normalize the affinities to probabilities, and then power the resulting probability matrix and multiply it by the original data matrix. Pretty straightforward, right? One of the things I really appreciate about this algorithm is how simple and elegant it is from a mathematical perspective. But enough math, let’s see this algorithm in action!
To test out MAGIC, we will install the Python version using pip:
pip install --user magic-impute
We will also install some single-cell analysis packages for preprocessing the data, as well as PHATE for visualization:
conda install -c bioconda scvelo
conda install -c conda-forge scanpy
pip install scprep phate
We will use a scRNA-seq dataset of neuron differentiation to illustrate MAGIC. You can read my RNA velocity article for more details on the biology, if you’re interested.


To walk you through these figures, on the x-axes is the mRNA abundance of a a gene expressed in neurons called VSNL1; the y-axes are the mRNA abundance of a gene called RUNX1T1, which emerges in a precursor cell type. Each dot is an individual cell, colored by the expression of another neuron marker called RBFOX1. We would expect a clear trend of decreasing expression of RUNX1T1 as cells turn into neurons along the x-axis, corroborated by expression of RBFOX1. The left panel is too noisy to appreciate this trend, but the right panel with MAGIC imputation successfully recapitulates this biological phenomenon.
We can also visualize both the original and imputed data with PHATE as follows:

Here, cells are colored according to RBFOX1 mRNA expression, and we can see a clear gradient on the right-hand figure as the cells turn into neurons.
Conclusion
In this article, we learned about MAGIC, a simple yet effective way to de-noise scRNA-seq data. While designed with biological data in mind, this method could be applied to other types of data, where missing values and noise may also be common. It should be important to note that care should be taken when choosing the exponent t by which to power the Markov matrix. Too high of an exponent can lead to over-smoothing, which could iron out existing biological variation, resulting in false conclusions from downstream analyses. Along this end, common critiques of it highlight how it doesn’t discriminate between true biological zeros and false positives, instead transforming the entire matrix rather than just the missing values. Nevertheless, one of the things I appreciate about this algorithm is how simple and elegant it is from a mathematical perspective. The area of imputation in single-cell biology is a very popular sub-discipline that has spawned an explosion of algorithms (select examples listed in the references below) in the past few years, with MAGIC as one of the pioneering methods that’s still commonly employed in the field.
Thank you for reading this article. If you are not already a Medium member, please consider signing up here for full access to all that this platform has to offer:
References:
[1] D. van Dijk, R. Sharma, J. Nainys, K. Yim, P. Kathail, A. J. Carr, C. Burdziak, K. R. Moon, C. L. Chaffer, D. Pattabiraman, B. Bierie, L. Mazutis, G. Wolf, S. Krishnaswamy, D. Pe’er, Recovering Gene Interactions from Single-Cell Data Using Data Diffusion (2018), Cell [2] K. Moon, D. van Dijk, Z. Wang, S. Gigante, D. Burkhart, W. Chen, K. Yim, A. van den Elzen, M.J. Hirn, R.R. Coifman, N.B. Ivanova, G. Wolf, and S. Krishnaswamy, Visualizing structure and transitions in high-dimensional biological data (2019), Nature Biotechnology [3] M. Huang, J. Wang, E. Torre, H. Dueck, S. Shaffer, R. Bonasio, J. I. Murray, A. Raj, M. Li and N. R. Zhang, SAVER: Gene expression recovery for single-cell RNA sequencing (2018), Nature Methods [4] W. V. Li & J. J. Li, An accurate and robust imputation method scImpute for single-cell RNA-seq data (2018), Nature Communications [5] G. La Manno, R. Soldatov, A. Zeisel, E. Braun, H. Hochgerner, V. Petukhov, K. Lidschreiber, M. E. Kastriti, P. Lönnerberg, A. Furlan, J. Fan, L. E. Borm, Z. Liu, D. van Bruggen, J. Guo, X. He, R. Barker, E. Sundström, G. Castelo-Branco, P. Cramer, I. Adameyko, S. Linnarsson, and P. V. Kharchenko, RNA velocity of single cells, (2018), Nature