Low-Cost Cell Biology Experiments for Data Scientists

Paper microscopes, public data repositories, and hosted notebook solutions

Paul Mooney
Towards Data Science

--

Introduction:

Can “citizen scientists” do real science without costly laboratory equipment?

In short: yes. In this blog post we point out an alternative and low-cost approach to solving biological questions that is suited to the aspiring amateur scientist. Specifically, we highlight the benefits of combining low-cost imaging equipment (FoldScope microscopes) and public image data (Kaggle Datasets) with free tools for performing computational data analysis (Kaggle Kernels). Furthermore, we provide a general framework for using these tools to tackle image classification problems in the biological sciences. We hope that these guidelines encourage more data scientists to publish cell biology datasets and to share what they learn.

Part 1: Obtaining image data

High-throughput and super-resolution microscopy methods have been invaluable to the study of cellular biology [1,2], but these technologies are prohibitively expensive for many or most research labs. The good news, however, is that innovative technologies like FoldScope microscopes can dramatically decrease research costs [3]. FoldScope microscopes are literally paper microscopes, made of a paper frame attached to a high-magnification objective lens (Figure 1). These multi-use microscopes have proven capable of imaging infectious agents such as Plasmodium falciparum and Schistosoma haematobium [3,4], and can be attached to the camera of a cellular phone. FoldScope microscopes currently retail for as little as $1.50 each, and their low cost has inspired an active community of citizen scientists.

In this work we acquired new cell images using a FoldScope microscope. We opted to start with commercially prepared slides, sold by the Carolina Biological Supply Company, containing dividing Ascaris lumbricoides cells and hematoxylin staining. Images were acquired using a FoldScope microscope equipped with a 500x magnification lens and a 8-megapixel digital camera (iPhone 5). In this way, we generated a dataset of images of cells in different stages of cellular division with a very minimal investment.

Part 2: Sharing image data

A FoldScope microscope was used to acquire 90 images of dividing cells in hematoxylin stained Ascaris lumbricoides uterus (Figure 2). These images were shared as a public dataset on Kaggle along with some starter code demonstrating how to work with the data. A dataset of only 90 images is inevitably limited and does not present a suitable challenge for the computational environment. As such, we identified a relatively large dataset of labeled images of Jurkat cells in various different stages of cellular division [5] and added it to our Kaggle Dataset as well. With this supplemental dataset of approximately 32,000 cell images, we could sufficiently test the ability of the free Kaggle Kernels product for performing complex computational analyses.

Part 3: Analyzing image data

Deep learning algorithms are exciting in part because of their potential to automate biomedical research tasks [6,7]. For example, deep learning algorithms can be used to automate the time-consuming process of manually counting mitotic structures in breast histopathology images [8,9]. Differences in the rates of cellular division and differences in the amount of time spent in each stage of cellular division are both important differentiators between healthy cells and cancer cells [10,11]. Likewise, cancer cells often form erroneous mitotic structures during cellular division and these erroneous structures can contribute to further disease progression [12,13]. As such, the study of cellular division and and its machinery has led to the development of numerous anti-cancer drugs [14,15]. In this work we use the cell division dataset from Figure 2 to train a deep learning model that can be used to identify cell cycle stages in images of dividing cells (Figure 3).

Here we present a simple approach that combines the Kaggle Kernel with the cell division dataset from Figure 2 in order to train a deep neural network to identify cell cycle stages. The free-of-charge Kaggle Kernel combines the data, the code, and the cloud-based computational environment in such a way that makes the work easy to reproduce. In fact, you can duplicate this work and you can run it in an identical cloud-based computational environment just by pressing the “Fork Kernel” button on the Kaggle website. (Additional discussion on research reproducibility and a set of recommended guidelines can be found here). By using a dataset of 32,266 images to train a deep neural network, we hope we’ve shown how well the free Kaggle Kernels environment can perform complex computational analyses that are relevant to the biomedical sciences.

The ML model that was trained under this new approach gave results that were comparable to the original analysis (Figure 4). Interestingly, both models still have room for improvement in identifying some of the more rarely-observed cell cycle stages, but this could likely be corrected for by generating additional data. The code in Figure 4B is meant to be generalizable and should work well with diverse image types and image categories: the fastai.ImageDataBunch.from_folder() function can be used to load and process any compatible image, and the fastai.create_cnn() function can be used to automatically learn new model features.

# Code to generate Figure 4A: https://github.com/theislab/deepflow
# Code to generate Figure 4B: (below)
from fastai import *
from fastai.vision import *
from fastai.callbacks.hooks import *
import numpy as np; import pandas as pd
import matplotlib; import matplotlib.pyplot as plt
img_dir='../input/'; path=Path(img_dir)
data=ImageDataBunch.from_folder(path, train=".",valid_pct=0.3, ds_tfms=get_transforms(do_flip=True,flip_vert=True,max_rotate=90,max_lighting=0.3),size=224,bs=64,num_workers=0).normalize(imagenet_stats)
learn=create_cnn(data, models.resnet34, metrics=accuracy, model_dir="/tmp/model/")
learn.fit_one_cycle(10)
interp=ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(10,10), dpi=60)

This work describes a reusable framework that can be applied to cell biology datasets to solve image classification problems. The specific image classification problem discussed was the automatic identification of cell cycle stages during cellular division, and the approach was notable because of the low cost of the equipment and because of how easy it was to train the model (with free cloud computing and open-sourced ML algorithms). Data scientists and cell biologists who want to train their own image classification models can easily replicate this approach by creating a dataset of their own FoldScope images, organizing the training data into folders that correspond with each image label, and then attaching that same dataset to a kernel that contains the Fastai code described in Figure 4. We hope that future researchers take advantage of the dataset and the starter code that we have shared in order to quickly fork, modify, and improve upon our models.

Conclusion:

Since the first compound microscopes were built in the 1600s, medical imaging and laboratory equipment has been expensive to produce and to own. Analytical software is much newer but in many cases can be equally expensive. More recently, low-cost tools like the $1.50 FoldScope microscope and the $0.00 Kaggle Kernel have been developed that can perform many of these same functions at a fraction of the cost. This work describes a low-cost and reusable framework that can be applied to cell biology datasets to solve image classification problems. We hope that these guidelines will encourage more data scientists to explore and publish cell biology datasets and to share their results and findings.

Works Cited:

[1] Miller MA, Weissleder R. Imaging of anticancer drug action in single cells. Nature Reviews Cancer. 2017. Vol 17(7): p399–414.

[2] Liu TL, Upadhyayula S, Milkie, et al. Observing the cell in its native state: Imaging subcellular dynamics in multicellular organisms. Science. 2018 Vol 360(6386).

[3] Cybulski J, Clements J, Prakash M. Foldscope: Origami-Based Paper Microscope. PLoS One. 2014; Vol 9(6).

[4] Ephraim R, Duah E, et al. Diagnosis of Schistosoma haematobium Infection with a Mobile Phone-Mounted Foldscope and a Reversed-Lens CellScope in Ghana. Am J Trop Med Hyg. 2015. Vol 92(6): p. 1253–1256.

[5] Eulenberg P, Köhler N, et al. Reconstructing cell cycle and disease progression using deep learning. Nature Communications. 2017. Vol 8(1): p463.

[6] Esteva A, Robicquet A, et al. A guide to deep learning in healthcare. Nature Medicine. 2019. Vol 25: p.24–29.

[7] Topol, Eric J. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine. 2019. Vol 1: p44–56.

[8] Li C, Wang X, Liu W, Latecki LJ. DeepMitosis: Mitosis detection via deep detection, verification and segmentation networks. Medical Image Analysis. 2018. Vol 45:

[9] Saha M, Chakraborty C, Racoceanu D. Efficient deep learning model for mitosis detection using breast histopathology images. Computational Medical Imaging and Graphics. 2018. Vol 64: p29–40.

[10] Sherr CJ. Cancer cell cycles. Science. 1996. Vol 274(5293): p1672–7.

[11] Visconti R, Monica RD, Grieco D. Cell cycle checkpoint in cancer: a therapeutically targetable double-edged sword. J Exp Clin Cancer Res. 2016. Vol 35: p153.

[12] Milunović-Jevtić A, Mooney P, et al. Centrosomal clustering contributes to chromosomal instability and cancer. Current Opinion in Biotechnology. 2016 Vol 40: p113–118.

[13] Bakhoum SF, Cantley LC. The Multifaceted Role of Chromosomal Instability in Cancer and Its Microenvironment. Cell. 2018 Vol 174(6): p1347–1360.

[14] Florian S, Mitchison TJ. Anti-Microtubule Drugs. Methods in Molecular Biology. 2016 Vol 1413: p403–411.

[15] Steinmetz MO, Prota AE. Microtubule-Targeting Agents: Strategies To Hijack the Cytoskeleton. Trends in Cell Biology. 2018 Vol 28(10): p776–792.

[16] The source for the images in Figure 3A, Figure 3B, and the Cover Photo can be found in each respective hyperlink.

--

--

Paul has a PhD in Molecular & Cellular Life Sciences and he blogs about science research and data science. http://www.kaggle.com/paultimothymooney