Deep Learning for Single Cell Biology

Image source
Resolving cellular architectures by Deep Learning

This is the second post in the series Deep Learning for Life Sciences. In the previous one, I showed how to use Deep Learning on Ancient DNA. Today it is time to talk about how Deep Learning can help Cell Biology to capture diversity and complexity of cell populations.

Single Cell RNA sequencing (scRNAseq) revolutionized Life Sciences a few years ago by bringing an unprecedented resolution to study heterogeneity in cell populations. The impact was so dramatic that Science magazine announced scRNAseq technology as the Breakthrough of the Year 2018. The major advance was a realization that despite biological cells might seem morphologically alike in the microscope, they can be very different in sense of genes they express, which in turn leads to functional discrepancies between the cells. To capture this cellular diversity, Human Cell Atlas community declared an ambitious goal to build a comprehensive map of the trillions of cells present in the human body.

With the development of 10X Genomics single cell gene expression platform, it is becoming almost a routine to obtain whole transcriptome information from hundreds of thousands and even millions of individual cells. Therefore scRNAseq (together with Biomedical Imaging and Genomics) represents currently a truly Big Data which has a superior statistical power and opens new horizons for applying Machine and Deep Learning for Single Cell data analysis.

Here I will give a brief overview of the filed, formulate major analytical challenges and show how to use Deep Learning with Keras and TensorFlow for unsupervised learning problems in single cell RNA sequencing data analysis.

Why Single Cell Biology is ideal for Deep Learning?

Performing a statistical analysis on some data we typically have to understand the balance between a) number of features p (genes, proteins, genetic variants, pixel of an image etc.), and b) number of observations n (samples, cells, sequences etc.). Human genome has approximately p=20K protein coding genes while recently published 10X scRNAseq data sets include n~1.3M and n~2M individual cells. This implies that n >> p for scRNAseq which is a typical Deep Learning limit, i.e. working in this limit we can go far beyond Linear Algebra based analysis and capture the highly non-linear structure in the scRNAseq data.

Recent studies report unprecedented scRNAseq sample sizes which is ideal setup for Deep Learning

For other limits, i.e. when n << p or n ~ p, Bayesian and Frequentist statistical frameworks, respectively, are more appropriate. In other words, working with scRNAseq data we have a luxury problem: while overfitting and lack of generalizability are common concerns in Computational Biology, here for scRNAseq we have to watch out for underfitting, i.e. how to use the most of the data.

Not only scRNAseq is currently booming in Life Sciences but also technologies delivering other types of information (OMICs in bioinformatics jargon) on a single cell level becoming more and more common. Recent advances in studying chromatin accessibility regions (scATACseq) result in data sets with >100K single cells. While scATACseq alone presumably can not guarantee a novel way of discovering rare cell populations (a primary goal of Single Cell data analysis), it provides tremendous potential for integration with scRNAseq and thus improvement of accuracy of assigning cells to a certain population.

Chromatin accessibility regions (ATACseq) is complementary for scRNAseq analysis

Finally, single cell multi-OMICs technologies (CITE-seq, scNMTseq etc.), i.e. multiple sources of information from the same biological cells, do not yet reach huge sample sizes typical for scRNAseq but are very promising for future Data Integration challenges with Deep Learning.

Reducing dimensions with Deep Learning

Since the primary goal of scRNAseq analysis is to discover novel cell populations, it is an unsupervised analysis in Machine Learning terminology. Hence, two most important analytical techniques used for scRNAseq are dimensionality reduction and clustering. Autoencoder is an unsupervised Artificial Neural Network (ANN) with the funny looking “butterfly” architecture which is often used for dimensionality reduction. In contrast to linear techniques such as Principal Component Analysis (PCA), Multi-Dimensional Scaling (MDS), Factor Analysis (FA) etc., Autoencoders perform non-linear dimensionality reduction and hence can capture highly non-linear structure of Single Cell data.

Autoencoder is an Artificial Neural Network (ANN) with the “butterfly” architecture

Here I will demonstrate the difference in Single Cell resolution between linear (PCA) and non-linear (Autoencoder) dimensionality reduction techniques using the ~8K cord blood mononuclear cells (CBMCs) CITEseq scRNAseq data set as an example. Please note that the code below assumes the input file in tabular format with genes as columns and cells as rows, the last column of the file must be cell annotation obtained using your favorite scRNAseq clustering technique. I recommend using graph-based clustering with Louvaine community detection, which is robust for high-dimensional data, implemented in the popular Seurat scRNAseq workflow.

For comparison, I am also going to add here the t-distributed Stochastic Neighbor Embedding (tSNE) plot, which is a current golden standard non-linear dimensionality reduction technique in scRNAseq area. One problem with tSNE is that it can not handle high dimensional data such as scRNAseq. Therefore, common practice is to perform PCA (linear!) as a pre-dimensionality reduction and feed the output into the tSNE. However, we can do it even better by performing the pre-dimensionality reduction step in a non-linear way with Autoencoders. Let us display the tSNE plots for both strategies:

If you are not used to looking at this kind of plots, here one point is one cell, the colors correspond to different cell types. You are supposed to observe 12 cell populations but you basically can not see them (cells heavily overlap) from the PCA plot because linear dimensionality reduction can not resolve the single cell structure. Autoencoder picture looks better, the different cell populations are clearly detectable. tSNE generally provides even sharper cell blobs. However, for this particular case, tSNE on Autoencoder seems to provide more dense and transparent clusters especially for the purple cell population which was smeared throughout the blue cluster in the tSNE on PCA plot. Thus, Deep Learning is promising for improving resolution of detecting novel cell populations.

Looking for scalable dimensionality reduction

In addition to difficulty handling high-dimensional data, tSNE scales poorly when numbers of cells reach hundreds of thousands and millions. FItSNE is a new promising modification of Barnes-Hut tSNE which seems to scale better for large amounts of data. However, running FItSNE on 1.3M mouse brain cells, I had troubles fitting it into the memory. Particularly, I wanted to obtain tSNE plots for high perplexities in order to check the global structure of the data. However, perplexity = 350 was the maximum perplexity which I was able to reach using a node with 256GB RAM on a computer cluster. To run FItSNE is very simple and similar to the way they perform tSNE in R (again the Cluster column contains cell annotations):

Uniform Manifold Approximation and Projection (UMAP) is another very interesting non-linear dimensionality reduction technique which currently seems to outperform tSNE in many aspects. It is faster than tSNE, equally fast as FItSNE but does not require as much memory, and it seems to capture both local and global structure of scRNAseq data. Perhaps the only drawback I see is a bit nontransparent mathematics behind UMAP which I am currently going through.

Recently a few interesting methods based on Variational Autoencoders have been released. One of them is SCVIS which is a neural network that captures and visualizes the low-dimensional structures in single-cell gene expression data, and in addition preserves both the local and global neighbor structures. To run SCVIS on 1.3M mouse brain cells I used the following command line:

Below I provide comparison of the 4 mentioned dimensionality reduction techniques, i.e. PCA, tSNE / FItSNE, UMAP and SCVIS using the 1.3M mouse brain cells from 10X Genomics:

Comparison of dimensionality reduction techniques on 1.3M 10X mouse brain data set

As for the case of CITEseq above, here we can see that in contrast to PCA the non-linear dimensionality reduction techniques (SCVIS, UMAP and tSNE / FItSNE) are capable of resolving all cell populations in the scRNAseq data. Out of the three techniques, UMAP was the fastest and provided fairly good low-dimensional representation of the data. To be more precise about computational times, SCVIS took ~6h, FItSNE took 3h and lots of memory, UMAP took ~3h on my laptop.

Taking into account the growing amounts of scRNAseq data, I would predict UMAP and Autoencoders to replace tSNE in the future.

Deep Autoencoder for scRNAseq with Keras

Finally, here I will demonstrate how to implement from scratch and run a Deep Autoencoder using Keras. As you will see, it is not that hard. In order to avoid difficulties with loading the whole data set into memory, I selected top 19 Principal Components which I found to be significant via resampling, i.e. shuffling the gene expression matrix and checking percentage of variance explained by the permuted matrix (null hypothesis). The Autoencoder progressively reduced dimensionas from 19 to 2 (bottleneck of the Autoencoder), each hidden layer was reducing one dimension.

Deep Autoencoder for dimensionality reduction of 1.3M 10X mouse brain cell, Keras implementation

We can see that the cell populations are quite distinguishable despite I did not try specifically to find an optimal configuration of the neural network that could potentially have provided a better resolution. A huge advantage of the Deep Autoencoder is that it seems to be very fast and hence scalable for large amounts of scRNAseq data, it took only a couple of minutes on my laptop for the model to converge and to obtain the plot above.

Deep Autoencoder for scRNAseq with TensorFlow

Keras is great, fast and easy but sometimes I get an impression that I control neural networks much better using TensorFlow. For example, with Keras one can never get a feeling of “manually” connecting the nodes, this is the depth in understanding which I appreciate to achieve with TensorFlow. A minor drawback of the latter is that such a useful trick as mini-batch learning has to be coded by hand in TensorFlow and is in contrast automatically included in Keras. Anyhow, here is the code of TensorFlow Deep Autoencoder and the dimensionality reduction plot:

Deep Autoencoder for dimensionality reduction of 1.3M 10X mouse brain cells, TensorFlow implementation

Again, the low-dimensional representation looks promising. With some tweaking of the configuration and other hyperparameters one could potentially obtain a much better resolution. Again, as for Keras, this TensorFlow Autoencoder implementation is super fast, just a couple of minutes compared to hours used by FItSNE, UMAP and SCVIS to produce this kind of dimensionality reduction plot. If resampling procedure is needed for some reason for your analysis, it would not be very feasible with FItSNE, UMAP and SCVIS but should be pretty straightforward to perform with Deep Learning.

Summary

Here we have learnt that Single Cell RNA sequencing (scRNAseq) is rapidly improving our understanding of functional diversity among biological cells. Dimensionality reduction is perhaps the most important analytical tool behind a typical scRNAseq analysis. The golden standard dimensionality reduction technique tSNE is currently experiencing difficulties with scalability due to large amounts of data delivered by scRNAseq technologies. Deep Learning via Autoencoder and UMAP provide currently most flexible and scalable ways to obtain low-dimensional representations of the scRNAseq data and will most likely replace tSNE in the future. Finally, we have learnt how to implement Deep Autoencoders for the gigantic 10X Genomics 1.3M mouse brain cells scRNAseq data set using Keras and TensorFlow.

Hope you enjoyed reading the post, let me know in the comments what Life Science direction would be interesting for you to probe with AI and Deep Learning, I would love to hear your feedback. Follow me at Medium Nikolay Oskolkov, in twitter @NikolayOskolkov, and check out the complete Jupyter notebook for this post on my github. Next post is going to be about using Deep Learning for multi-OMICs biomedical data integration, stay tuned.