Hands-on Tutorials

Integrating Omics using UMAP and Clustering

Joint unsupervised exploration of proteomic and transcriptomic data via dimension reduction and clustering using Python

Egor Vorontsov

Published in

Towards Data Science

9 min readFeb 14, 2021

Scatter plot of the two-dimensional UMAP embedding with subsequent density-based clustering of the joint data set — Joint 2D-embedding of the protemic and transcriptomic data sets with subsequent clustering. Image by author

Life science is driven by data these days. You may have heard the words genomics, transcriptomics, proteomics, even if you are not a biologist or medical researcher. The omics are scientific approaches that attempt to holistically understand a subsystem of a cell, tissue, or organism, and they generate massive amounts of information on the way. The better these techniques are at producing data, the higher is the demand for new computational methods to deal with it.

I have a professional interest in integrating proteomic data with information from other omics, primarily from transcriptomics. In a nutshell, proteomics aims at determining the levels, functions, and interactions of proteins in a living system, while transcriptomics captures a snapshot of the RNA molecules that are a product of gene expression. Gene expression and the functioning of proteins are intricately connected, so exploring them together combines two crucial pieces of the cellular puzzle. The computational approaches for omics integration are in active development: see, for example, Multi-Omics Factor Analysis (MOFA and MOFA+ packages) [1] or mixOmics [2]. These packages cover a range of use cases for supervised and unsupervised integration of diverse data types.

Automated equipment that enables omics technologies. Photo by National Cancer Institute on Unsplash

Recently, I turned my attention to the dimension reduction technique called Uniform Manifold Approximation and Projection, or UMAP [3], [4]. Yes, I am late to the party, but better late than never! The real interest for me came after I had read a TDS blog article by Nikolay Oskolkov in which he leveraged UMAP to construct a combined embedding of the single-cell RNA-seq and proteomic data. It became evident to me that UMAP provides a facility for finding an internal structure in mixed data sets, and I was wondering if that could become an efficient way to jointly cluster the genes according their abundance profiles on the protein and RNA levels, the task complicated by the differences in the ways the data is acquired in mass spectrometry-based proteomics and transcriptomics.

Loading and formatting data

In this post, we will be re-using the data from the paper by Hultqvist and coauthors, including yours truly, published in the peer-reviewed journal Nature Ecology & Evolution [5]. The manuscript is behind the paywall, but the data sets are publicly available, thanks to the publishing policies that encourage transparency and sharing of research data. Mass spectrometry-based proteomic data can be found in the PRoteomics IDEntification database (PRIDE); converting mass spectrometry files into protein-level data can be complicated without prior experience in the field, so I have re-processed them to get a comprehensive protein table, which is placed into the GitHub repository for this article. RNA-seq transcriptomic data is publicly available via Gene Expression Omnibus (GEO), I have downloaded the RPKM (Reads Per Kilobase per Million reads) table from the page. Those of you who are interested in the biological side of the study can refer to the GEO project page for a description of the samples.

The experiment consists of Escherichia coli cells belonging to 5 different conditions, I will call them WT (wild-type, a default cell-culture), ORF, ORF+IPTG, SVI and SVI+IPTG (cells under genetic modification and chemical treatment). There were 2 cell cultures grown in parallel for each condition, we will call them biological replicates. The purpose of growing the samples in duplicate was to assess the variation arising from culturing, as opposed to the variation that is caused by the differences between the conditions.

The complete Jupyter notebook can be found in the Github repository, I will be only showing the most relevant snippets of code here. We will need umap-learn, scikit-learn and the common data analysis libraires:

Load proteomic and RNA-seq data:

(1955, 48)
(4205, 37)

The tables contain a bunch of columns, most of them are not necessary for this analysis. We will add gene names to each protein (linking proteins to genes in RNA-seq), select the abundance columns and log-transform them:

Each row of relative abundances in the proteomic table is represented as a ratio with the WT1 abundance for that protein as denominator. This is rather typical for this particular proteomic technology (so-called isobaric labeling, link to the vendor’s web-site for info), and it emphasizes the relative nature of the abundance data, at the same time conserving the information about variable magnitudes of the abundance changes. The general idea of an experiment using isobaric labeling would be to take equal amount of total protein mass from each sample. Because of that, we would expect the median ratio of protein abundances between samples to be around 1.0, as some proteins go up, some go down, and the total protein amount remains the same. Deviations from the median can be caused by imperfections of the sample preparation, and we could compensate for that by re-normalizing the data so that the median in each samples equals to zero in log-space:

The proteomic distributions are now quite symmetric and centered around zero!

The RPKM values in the RNA-seq table are distributed very differently:

It can be a good idea to format the RNA-seq data in the same way as the proteomic data: use WT1 intensity as denominator for each gene and log-transform the resulting ratios:

Re-normalize to equal medians:

Selecting genes with the strongest variance

Due to the large number of features and the noisy nature of biological data, it makes sense to select only the most informative features for further analysis. You can find great insights into feature selection in another post by Nikolay Oskolkov.

Our data set has 5 biological conditions, 2 replicates in each. Why don’t we leverage the replicates for feature selection here? Simply put, we want the informative features with a “strong change” across biological conditions, and we wish to filter out the features that remain relatively unchanged. We can define “strong change” in a self-contained manner: a change between conditions is “strong” if it exceeds most of the values for the variation between replicates within conditions. We could calculate the log2 difference between replicates for each protein/gene:

The variation is visibly higher in the RNA-seq data set. Let’s now look at the differences between the average values for the conditions:

Average groups in the protein table:

Average groups in the RNA table:

Now we can see that both the differences between replicates and between the conditions are stronger for the RNA-seq data set. This can be because the RNA data is noisier than proteomic data, and/or because of the higher dynamic range of the RNA-seq measurement as compared to the mass spectrometry method that has been used in the study.

With this ready, let’s join the tables by gene names and select the rows where either the maximum change between average protein values is higher than 90th percentile for the replicate differences for proteins or the difference between average RNA values is higher than 90th percentile of the replicate differences for RNA:

Proteins 90th percentile: 0.2658608755331754
RNA 90th percentile: 0.8764194924938945
Before: (1778, 32)
After: (951, 32)

UMAP embedding

The data is ready for UMAP now! We will select average values for proteins and RNA from each row, because we would like to emphasize the differences between biological conditions and omit the variation within conditions. We could also exclude the WT samples, since all the values there are very close to zero by the very nature of our data transformations (WT1 in the denominator). Let’s calculate and visualize the separate embeddings on the protein and RNA, as well as the joint embedding:

Great, we can see clusters forming! The shape and distributions of the clusters will differ depending on the settings, including the important n_neighbors parameter. I played with the n_neighbors in the range of 10 to 100 and settled at 30, which is close to the square root of the number of features (951).

Clustering

Scikit-Learn gives us a range of options for clustering that we can apply to the results of the embedding. Since we see the dense clusters on the plot, it would be natural to try the density-based clustering or DBSCAN. The eps parameter is important and should be tuned to the data. We can also apply other commonly used algorithms, for example, mean shift and agglomerative clustering. The crucial parameters in those cases are the bandwidth and the number of clusters, respectively. Let’s look at the overview of the clustering done by these three methods:

Found clusters 52
Found clusters 43
Found clusters 45

From a bird’s-eye view, the algorithms have highlighted the density clusters and “constellations” of points on the scatter plot in different ways. But did they capture the groups of similarly behaving genes? To judge about that, we can represent the profiles of the relative abundances in each cluster as heatmaps. Let’s take the DBSCAN results as an example:

It does indeed look like the clusters have consistent internal structures! Tuning of the clustering parameters can help to further adjust the size and composition of the resulting groups if necessary. In the context of omics integration, it’s exciting to explore the clusters that show divergence between the protein profiles (P-samples) and RNA profiles (R-samples). Take the clusters 36 and 37 as examples:

We can see some intriguing profiles of the relative abundance where proteins and transcripts are apparently not in sync! Furthermore, the genes in the clusters are presumably related to each other. The exploratory data analysis has thus provided interesting candidates for further scientific investigation!

Conclusion

Joint UMAP embedding and subsequent clustering on the proteomic and transcriptomic data from the same experiment is a straightforward way to highlight the groups of similarly behaving genes. In this post, we have looked at the filtering of the data, UMAP dimensionality reduction using umap-learn package and clustering using three algorithms implemented in scikit-learn.

References

[1] Argelaguet et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data (2020), Genome Biol 21, 111

[2] F. Rohart et al. mixOmics: An R package for ‘omics feature selection and multiple data integration (2017), PLOS Computational Biology 13(11): e1005752

[3] L. McInnes L, J. Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018), ArXiv e-prints 1802.03426, 2018]

[4] E. Becht et al. Dimensionality reduction for visualizing single-cell data using UMAP (2019), Nat Biotechnol 37, 38–44

[5] J. Jerlström Hultqvist et al. A bacteriophage enzyme induces bacterial metabolic perturbation that confers a novel promiscuous function (2018), Nat Ecol Evol 2, 1321–1330