Bioinformatics? Computational Biology?

What's with all the buzz?

SudoPurge
Towards Data Science

--

This may seem like a straight-forward question but I wouldn’t be writing an article on it if it really were. A quick Google search and Wikipedia will render the following definition, “an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex”. Let's dissect this statement!

The first three words that Wikipedia uses to describe it are “an interdisciplinary field”. Think of three broad fields that may help; Biology, Computer Science, and Data Science. But what exactly are the disciplines within these three fields? When it comes to applications in the context of Biology, the current realm includes everything from Molecular Biology, Genetics, Genomics, Proteomics, and Metabolomics, to Systematics, Evolution, Pharmacology, Biomedicine, and Health Sciences. This is not an exhaustive list. When it comes to Computer Science, there are obvious ones like Software Engineering, Database Management, Information Engineering, Natural Language Processing (NLP), and Image Processing.

The field of Data Science is where all these get muddled up into mushes and we get to play mix and match! The common theme though is using Machine Learning or any other sort of Artificial Intelligence to empirically answer biological questions. For instance, combining Genetics with NLP enables us to empirically analyze genetic data, leading to insights and knowledge that could not otherwise be unearthed. When the Human Genome Project was initialized 20 years ago, many thought we would be able to cure all genetic diseases once we can sequence the genome. But that was only the beginning. Mother Nature wouldn’t let its secrets out to us just like that, would it? Gene sequencing produces massive datasets of noisy, cluttered, and dirty data. Enter NLP. Genetic analyses using NLP can easily detect the different regions of the genome such as the repeating sites, coding, and non-coding sites, homologous regions, to name a few. The cost of sequencing DNA has fallen drastically since the Human Genome Project. An estimated US$95 million was spent in 2001 to sequence one whole human genome, compared to around US$950 today, thanks to Next-Gen Sequencing, making it even more accessible, thereby, producing a LOT more data.

Created by author. Data collected from National Human Genome Research Institue

Another area of huge potential is combining Neural Networks with Proteomics and Pharmacology. The Biopharmaceutical industry has only recently realized how to harness the powers of Data Science for protein modeling. There are hundreds of thousands of organic molecules that can interact with our physiology and metabolism. 98% of these are not ideal candidates for improving our standard of living though. However, that still leaves us with thousands of other organic molecules, the remaining 2%, that could indeed be used to treat many of our diseases and conditions, or in fact, enhance us. But recognizing these potential candidates is the impossible part. Enter Neural Networks.

Other more specific use-cases involve analyzing large datasets produced from high-throughput experiments, modeling evolution and systems biology, modeling epidemiologic studies like the spread of Covid-19, designing synthetic cells, nucleic-acid based information storage systems, and much much more.

One particularly exciting area of Bioinformatics is its use in Precision and Personalized Medicine. Doctors are learning more and more about the influence of genes on a person’s physiology and metabolism, and more importantly, how drugs interact with their exact phenotype.

Using the patients’ personal genetic makeup, and comparing it with the database of thousands of other patients, doctors can make a more informed decision about treatment and therapeutic choices. Some drugs may be more effective for a patient than others, while the same drugs may have more unwanted side-effects than the desired effects on some other patients. All these could theoretically be evaluated, given the hospital has access to a database of thousands of other patients who have similar genotypes and have been treated with similar drugs, in order to predict the effects in the current patient’s body. The medical sciences would be much more robust, empirical, and causal, than correlational and uncertain.

Another area of research that is currently taking massive strides, is the application of image analysis on radiological images. A few years ago, algorithms like Convolutional Neural Networks (CNN) were only used for computer vision. Recently, CNN has been implemented for classifying radiological reports like MRI and fMRI scans. Clustering and Classification algorithms rapidly improved with this implementation. Previously, these algorithms could detect benign and malignant tumors only half as well as a trained Radiologist. Then they improved to being as good as a trained Radiologist. And now they are even better than highly trained Radiologists. The possibilities are really endless when you combine Biology, Computer Science, and Data Science.

The term is often used synonymously with “Computational Biology”. While it is true that much of these two overlaps with each other, the subtle difference that sets them apart is the fact that Computational Biology refers slightly more towards the use of already available computational tools in Biology, while Bioinformatics refers slightly more towards the development of these tools, rather than their use. However, the overlap between these two disciplines overwhelms their differences. So practically speaking, it does not really make much of a difference.

Biological experiments produce multitudes of data today, compared to 20 years ago. With Open Source databases like NCBI, scientists from different parts of the world can share their data and collaborate with ease. We have now crossed the “Excel barrier”. Bioinformatics researchers who are adept with statistics, data, and programming, on top of their domain knowledge, are in much more demand than before. Tools such as pandas, Numpy, matplotlib, RShiny, and others are much more effective and efficient, especially with the resurgence of Machine Learning. Previously, Bioinformaticians were viewed more like computer scientists who could manage a database and run some algorithms. But modern Bioinformaticians identify what questions a researcher should ask, how to look for the right answers, and what to do with the large amounts of data produced from those studies. These scientists are integrated into every research team, making them the core of innovation. Biology has been ushering in the age of AI and Big Data and with it, created a new line of scientists; Bioinformaticians, or Computational Biologists.

P.S. For more short and to the point articles on Data Science, Programming and how a biologist navigates his way through the Data revolution, consider following my blog.

Thank you for reading!

--

--