The world’s leading publication for data science, AI, and ML professionals.

RNA Velocity: The Cell’s Internal Compass

Finding Direction in Single-cell RNA-sequencing Data

Thoughts and Theory

Photo by AbsolutVision on Unsplash
Photo by AbsolutVision on Unsplash

Single-cell RNA sequencing (scRNA-seq) has revolutionized the way we study cell biology the past decade, ushering in the rapidly growing field of single-cell Genomics. scRNA-seq allows us to profile the transcriptome of individual cells, gaining holistic insight into their function and identity. The transcriptome is the set of all messenger ribonucleic acid (mRNA) molecules of a cell. You’ve likely heard of mRNA frequently over this past year with many of the COVID-19 vaccines relying on this molecule. mRNA represents the readout, or transcript, of a gene. These molecules carry information about the gene of interest that the cell’s machinery uses to build a protein. In the mRNA-based COVID-19 vaccines, they carry information on how to make a benign protein tailored to the SARS-CoV2 virus, effectively teaching our body how to fight off infection. The production of these mRNA molecules is called transcription, and these molecules are sometimes called transcripts accordingly. You’ll often hear biologists refer to mRNA abundance more broadly as gene expression. You can think of it like building IKEA furniture: the instructions are the transcribed mRNA molecule, and the corresponding protein is the furniture. Before the mRNA is translated, however, it has to undergo an additional step. There are segments of mRNA molecules that don’t code for proteins called introns. These are removed prior to protein translation through a process called splicing. The remaining segments are called exons. This will be important to know shortly.

Image by Author
Image by Author

You can think of the transcriptome as the cell’s fingerprint that identifies it and distinguishes it from other cells (e.g., a red blood cell versus a neuron). By sequencing hundreds of thousands of cells for their transcriptome, we can build a holistic, high-dimensional map of the cellular states that stem cells traverse as they turn into more specialized cells like neurons, characterize cancer tumors, and distinguish between healthy and diseased tissue from patients. These sequencing protocols consequently produce gigabytes of data to be analyzed and made sense of, which can be complicated by having so many dimensions in our dataset, each corresponding to a gene (see my first article on one way to resolve this conundrum). Seems really cool, right? However, there is one key drawback to this approach. These sequencing experiments kill cells in the process, preventing us from re-sequencing these cells again at later timepoints. Hence, we are left with a static snapshot of a cell’s mRNA abundance, which can make inferring its future gene expression challenging. How can we resolve this?

This is where RNA velocity comes in. RNA velocity is a simple, yet powerful method that allows us to predict the future gene expression of a cell. Recall that I mentioned mRNA can be distinguished by spliced and unspliced transcripts. While spliced transcripts are the main readout of most scRNA-seq protocols, unspliced expression is also measured. Using this supplemental information, we can build a simple mathematical model to predict future spliced expression.

Image by Author
Image by Author

where u is the number of unspliced mRNA molecules, s is the number of spliced mRNA molecules, α is the rate of transcription, β is the rate of splicing from unspliced to spliced, and γ is the rate of degradation of the spliced mRNA product.

Taking the difference between the future mRNA expression and the current mRNA of a particular gene for a given cell, we can derive a metric of the changing gene expression. These can be aggregated for all genes of a given cell to create a vector of the cell’s future transcriptome, representing the speed and direction of said cellular change – hence the term RNA velocity. A common way to visualize this is by superimposing a vector field onto cells in an embedding as shown below:

RNA Velocity of cells undergoing neurogenesis in PCA embedding from La Manno et al, Nature 560: 494–498 (2018)
RNA Velocity of cells undergoing neurogenesis in PCA embedding from La Manno et al, Nature 560: 494–498 (2018)

This is a PCA plot of cells undergoing neurogenesis – the production of neurons (red) in the brain from stem cell-like cells called radial glia (blue). Each dot represents a cell with an RNA velocity vector superimposed on it, predicting cell transitions. Notice how these arrows get longer in the intermediate stages (neuroblasts and immature neurons) – this is indicative of dynamic changes in gene expression, as genes for neurons are turned on, while those for the radial glia cells are turned off. As these cells successfully turn into neurons, the vectors rapidly shorten, reflective of a "deceleration" of the gene expression of these cells. You can think of these vectors as akin to a cell’s compass: an internal GPS that pinpoints which direction the cell needs to go if it wants to become a neuron.

There are two main models of RNA velocity: the steady-state model and the dynamical model. The steady-state model is the original model published in Nature, whose implementation is used in the above figure. This model assumes a universal, transcriptome-wide splicing rate and that gene expression follows a steady-state: i.e., velocity is estimated as a deviation from the γ-fitted ratio of spliced versus unspliced molecules when ds/dt=0 given by u = γs. I have an example of this below:

Image by Author
Image by Author

This figure plots individual cells from the previously shown neurogenesis PCA according to their spliced vs. unspliced molecule counts transcribed from the gene SOX2, which is a marker of radial glia cells (colored in blue). The solid line is a fit to u = γs derived from a generalized linear regression model (see my last article for an introduction to this statistical model). Cells above this line (u > γs)are considered to have positive velocity. This means the production of unspliced SOX2 mRNA molecules exceeds its degradation of its spliced counterpart: we have a net production of SOX2 mRNA. By contrast, cells below the line (u < γs) have negative velocity: more SOX2 is being degraded than produced, resulting in a net loss. Cells directly on the line have a zero velocity, as they don’t deviate from this γ-fitted ratio of unspliced versus spliced mRNA molecules.

The dynamical model, by contrast, directly solves for the full transcriptional kinetics for each gene, rather than making transcriptome-wide assumptions. Instead of trying to fit the data to a regression model, it estimates the parameters using the Expectation-Maximization algorithm (E-M), which uses maximum likelihood to iteratively approximate α, β, and γ, and learn the spliced/unspliced trajectory for a given gene. This assigns a likelihood as follows to each gene:

Where

which represent the observed unspliced and spliced mRNA molecules, respectively, for a particular gene in cell i, while xₜᵢ represents the unspliced/spliced molecules at time tᵢ based on the inferred parameter set θ = (α, β, γ). Genes with a high likelihood are considered to be major contributors to dynamic changes in the biological phenomenon of interest – in our case, neurogenesis. But enough math, let’s see this in action!

To try out RNA velocity for yourself, install the package scvelo using the following with either pip:

pip install -U scvelo

If you prefer conda, you can install via the bioconda channel as follows:

conda install -c bioconda scvelo

You will also need scanpy, which you can install from conda-forge. This is a popular library for single-cell analysis in Python.

conda install -c conda-forge scanpy

We will use the human neurogenesis dataset described above as an example, where I’ll illustrate and compare the steady-state and dynamical models. This dataset has 1720 cells representing four main cell types in neurogenesis: radial glia, neuroblasts, immature neurons, and neurons.

First we will load the dataset from the internet:

Next, let’s import our packages

We will read our dataset into an Annotated Data (anndata) object. This a popular way to store single-cell data for analysis in Python:

Anndata overview, from Wolf et al, SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018) and https://anndata.readthedocs.io/en/latest/
Anndata overview, from Wolf et al, SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018) and https://anndata.readthedocs.io/en/latest/

Now we will calculate the RNA velocity. scvelo uses two functions for this: scv.tl.velocity() which computes the velocities for each gene, and scv.tl.velocity_graph() which outputs a velocity graph based on cosine similarities between velocities and potential cell state transitions. In other words, it measures how well a change in gene expression of a cell matches its predicted velocity vector, which can be used to derive transition probabilities. We’ll also change the numerical cluster labels in the data to the defined cell types in the paper.

This gives us the following PCA plot, similar to the figure from the paper:

Image by Author
Image by Author

As you can see, each cell has an arrow superimposed on it that predicts its future transcriptome. These arrows in concert give us a sense of global directionality, showing a linear progression from radial glia cells to neurons. You may have noticed, however, that some of the neuroblast cells in green appear to be reverting back to radial glia. There’s similar phenomenon in some of the neurons, albeit to a lesser extent. These backflows can be attributed to two main reasons: 1) as mentioned previously, the steady-state assumption of the original RNA velocity model does not accurately reflect more transient cell populations such as the neuroblasts; 2) scRNA-seq is very sparse. Genes are prone to evading detection by the sequencing platform, resulting in "drop-out", where these mRNA measurements are erroneously recorded as zeros. One way to circumvent this is using imputation, where we computationally infer the missing values. A similar approach is by data smoothing to remove the noise generated by the inherent sparsity, which the authors of the original paper did via k-nearest neighbors pooling. There are drawbacks to each of these methods, as well as debate as to whether they should be used at all for scRNA-seq analysis as the results can be misleading. For the purposes of this article, I will continue present the dataset in its raw form.

We can further explore the gene-specific velocities of cells by generating a phase portrait:

On the far left is a spliced v. unspliced plot of SOX2 expression, a marker of radial glial cells. Each point is a cell colored by its cell type (blue for radial glia, green for neuroblast, dark gold for immature neuron, and red for neuron). As mentioned earlier, the deterministic model solves for future expression by assuming a steady-state model. The line represents the regression fit of the steady-state solution of ds/dt, represented by u = γs. Cells above this line are assumed to have positive velocity: SOX2 is being up-regulated with a net production produced in these cells more than its being degraded. We see that a few radial glia appear to display this phenomenon as shown by the the expression map on the right, where cells in the above PCA plot are colored according to their levels of SOX2 expression. However, most are shutting down SOX2 activity (indicated by negative velocities in the middle plot) as they begin to differentiate, as evidenced by the zero velocity of SOX2 throughout the other cell types: they are no longer transcribing nor degrading the mRNA molecule for it, which explains the minimal levels of expression in those cell types.

As mentioned, while powerful, the steady-state model is limited in that it assumes steady-state expression levels and a constant splicing rate. In more complicated systems, with multiple end states (i.e., different cells a stem cell can terminally differentiate into) and heterogeneous subpopulations, this assumption can fail and yield erroneous interpretations. This is where the aforementioned dynamical model comes in. We can implement the dynamical model using the following code:

Here, we observe greater consensus in direction among neighboring cells (albeit still a bifurcation in direction among the neuroblasts, which can be resolved by data smoothing), with a clear dominant trend towards neurons. Using the likelihood estimates for each gene, we can pull out the genes that display the most pronounced dynamic behavior:

NPAS3        0.994760
CREB5        0.896914
FOS          0.882965
SLC1A3       0.806150
SYT1         0.749349
MYT1L        0.737498
MAPT         0.687999
GLI3         0.663777
DOK5         0.656130
LINC01158    0.651856
RBFOX1       0.608533
TCF4         0.592459
FRMD4B       0.580061
HMGN3        0.577301
ZBTB20       0.575223
Name: fit_likelihood, dtype: float64

This plot shows the unspliced/spliced trajectory of the gene RBFOX1, with points colored according to cell type (blue for radial glia, green for neuroblast, dark gold for immature neuron, and red for neuron), with a fitted curve of the learned dynamics in purple from the E-M algorithm. You can see how the solid curve gives a better fit than the dashed steady-state model, with a rapid increase in the transcription of unspliced mRNA relative to spliced mRNA as the cells go from radial glia to neurons, indicative of this gene being activated in neurons, while inactive in radial glia. We can observe similar dynamics of several genes being up-regulated and down-regulated at different parts of neurogenesis as shown below:

Final Thoughts

A key drawback of scRNA-seq is that it only produces a snapshot of a cell’s transcriptome; you cannot revisit your sample for successive sequencing. RNA velocity is an effective way to circumvent this limitation by leveraging RNA biology to infer future gene expression, drawing out the cell’s innate transcriptional compass and lending a sense of directionality to your data. This can be used to illustrate different pathways a cell may take to reach a given endpoint, how rapidly gene expression is changing, build a holistic trajectory of how stem cells become more specialized, the list goes on. However, it is not without its own limitations. As mentioned earlier, scRNA-seq is often sparse and noisy. Velocities can subsequently be unintuitive or simply incorrect in the lens of ground truth for well-described biological processes. While methods like imputation and data smoothing can help correct these issues, they should be used with care as they are prone to smoothing out actual biological noise that can lend valuable insights, such as potential transitions to cells that don’t exist in our dataset. All that said, it is a powerful method that has provided a lot of clarity to making sense of high-dimensional biological data, and one that will continue to be further refined as the field of data science continues to grow. One of the goals of single-cell genomics this decade is effective integration of high-dimensional datasets measuring different properties of the cell (e.g., RNA, protein, epigenetics, etc). Methods like RNA velocity can be extended to learn the relationship between these different modalities to make powerful inferences as to cell-to-cell transitions, and hence, fine-tune the cell’s compass.

References:

[1] G. La Manno, R. Soldatov, A. Zeisel, E. Braun, H. Hochgerner, V. Petukhov, K. Lidschreiber, M. E. Kastriti, P. Lönnerberg, A. Furlan, J. Fan, L. E. Borm, Z. Liu, D. van Bruggen, J. Guo, X. He, R. Barker, E. Sundström, G. Castelo-Branco, P. Cramer, I. Adameyko, S. Linnarsson, and P. V. Kharchenko, RNA velocity of single cells, (2018), Nature

[2] V. Bergen, M. Lange, S. Peidli, F. Alexander Wolf & F. J. Theis, Generalizing RNA velocity to transient cell states through dynamical modeling, (2020), Nature Biotechnology

[3] V. Bergen, R. Soldatov, P. V. Kharchenko, F. J. Theis, RNA velocity – current challenges and future perspectives, (2021), Molecular Systems Biology


Related Articles