The world’s leading publication for data science, AI, and ML professionals.

How to Start Learning Bioinformatics and Not Get Intimidated(With R)

A Step-by-step Example using Differential Gene Expression Analysis

Introduction

This article is aimed towards people who are looking to "break into" the Bioinformatics realm and have experience with R (ideally using the tidyverse). Bioinformatics can be a scary-sounding concept (as least it is for me) because it is such a vast and fast-developing field that it can be difficult to define exactly what it is. I’ve always thought that bioinformatics was a highly advanced field beyond what I was capable of doing – that I would need years of technical training to begin actually doing it. But like with everything, it doesn’t actually take much to begin doing something (it goes without saying that it does take years to master something).

Acknowledging that I’m oversimplifying, bioinformatics is essentially the in silico (or data-based) approach to answering biological questions. With the advent of more advanced sequencing technology and accompanying developments in statistical algorithms, we now have unprecedented access to biological data at a scale and price previously unheard of as well as the tools to extract insights from this data.

In this article, I aim to provide an example of an easy way that anyone who likes data, likes to work with R, and has an interest in this field, can start doing analyses in the bioinformatics realm.

There are loads of different types of questions and different types of biological data, so for this article, we will be performing a differential Gene expression analysis (DGEA) using RNA-Seq data. Briefly, if you take a trip down memory lane to your high school/college biology classes, DGEA answers the question of whether there are changes in expression levels in genes between different experimental conditions, and RNA-Seq (short for "RNA sequencing") uses next-generation sequencing to quantify the amount of RNA in a given transcriptome (the set of all RNA transcripts). We’ll be getting the data from recount2. Essentially, it is a database maintained by Johns Hopkins that interfaces with R through their package, and the user can query RNA-seq read count data from it. Counts are the number of sequenced reads that align with a particular region of a gene.

Please note that this article is aimed at beginners so a lot of details about the motivation behind certain analytical steps within DGEA or about all the capabilities of the package will not be included. But, it’s my hope that once you’re able to successfully do something, you’ll feel empowered to learn more!

Setup

So, let’s first set up our workspace, which includes loading our libraries and RNA-Seq data into R that we’ve queried from recount. I will be using an RNA-Seq dataset related to coronary artery disease, given that’s my academic interest. In this particular dataset, cases refer to coronary artery smooth muscles cells with the TCF21 gene silenced. Recount comes with a nice Shiny app user interface where you can search for projects that you find interesting and query the data using the associated project ID; however, there are also functions within the recount package that allow you to search for projects without using the app interface.

With most data analysis projects, there are 3 main steps that I like to consider: exploratory data analysis, the analysis, and the visualization. Before we get started, it is important to note that I’ve chosen to use DESeq2 for this analysis, although there exist other packages that do DGEA using different methods. We first have to build a DESeqDataSet object to use in our analysis. This object essentially contains three components: colData contains info about the samples, rowRanges contains info about the genomic ranges and assay contains the matrix of counts.

We use dds <- DESeq2::DESeqDataSet(rse_gene, design = ~ case) to make our DESeqDataSet object where case specifies whether the sample was control or experimental. Let’s get our hands dirty by making some exploratory figures.

Exploratory Data Analysis

First, we can assess overall similarity between samples using Euclidean distance. The details behind this will not be covered in this article. It is important to clarify that the counts will be transformed solely for the purposes of exploration. The point of transformation is to make the data more homoskedastic to aid in exploration (the specific transformation used is the variance-stabilizing transformation, the details of which can be found in this paper). Otherwise, for the statistical analysis, raw counts should be used. This is the accompanying code to produce a heatmap of distances.

Another way to assess similarity between samples is to use a principal component analysis (PCA) plot. Without going into too much detail, it’s essentially a plot where the x-axis explains most of the differences between all the samples, and the y-axis explains the second-most differences between all the samples. Note that these percentages do not add up to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).

Differential Expression Analysis

Now that we’ve visualized our data, let’s run the analysis. The package we’re using makes it super easy; we just run dds <- DESeq(dds). You can see more of what it does by using the help command in R: ?DESe, but briefly it does three things: estimates size factors (controlling for differences in the sequencing depth of the samples, but don’t worry too much about what that means for now), estimates dispersion values (again not going to get into the technical nitty-gritty, think of dispersion as a measure of "spread") for each gene, and fitting a generalized linear model. If those words didn’t make sense to you, that’s ok. My philosophy is do first, ask later, which I understand may be quite controversial, but I think oftentimes students can get bogged down with all the theory and technical details (of which there are a lot to achieve true mastery and understanding) so much so that they never progress to the practical aspect of things.

There are two main estimates for each gene that will be returned in a dataframe: baseMean, the average of the normalized count values, divided by the size factors, and log2FoldChange, effect size estimate. for example, a log2 fold change of 2 means that the gene’s expression is increased by a multiplicative factor of 2²=4. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability.

Visualization

One useful thing we can do is visualize the distribution of estimated coefficients in the model across all the genes. We can use an MA plot. The log2 fold change (LFC) for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis. Each gene is represented with a dot. Before making the MA-plot, we use the lfcShrink function to shrink the log2 fold changes for the comparison of cases vs controls. There are three types of shrinkage estimators in DESeq2, which are covered in the DESeq2 vignette. We’re going to use the apeglm method for shrinking coefficients, which is good for shrinking the noisy LFC estimates while giving low bias LFC estimates for true large differences.

It’s also always good practice to perform diagnostics about your p-value histograms. This is a really good article explaining that.

res_shrink %&gt;%
  filter(baseMean &gt; 1) %&gt;%
  ggplot(aes(x = pvalue)) + 
  geom_histogram(col = "white")

Conclusion

So what did we accomplish? We performed our first differential gene expression analysis, and we identified some candidate genes that are upregulated/downregulated when the TCF21 gene is silenced in coronary artery muscle cells! Pretty cool, right?

There are a lot more things you can do with differential expression analysis. I’ve barely scratched the surface. If you want to go into more depth, I recommend looking at this vignette or just googling "differential gene expression R" and going down the rabbit hole. Hopefully, this article was able to demonstrate that starting something entirely new should not be a scary ordeal, and remind you that you always gotta start somewhere. With the right amount of interest and motivation, however, once you start, you’ll soon be googling things you don’t understand or what to learn more about, and you’ll be on your path to mastery!


Related Articles