The world’s leading publication for data science, AI, and ML professionals.

Cellular Image Classification with Intelec AI: A step by step guide

Did you ever wonder if you could classify images without any knowledge about deep learning? Nowadays great free libraries and software…

Fluorescence microscopy: visualizing a cell with its components. Image by the author using rxrx library
Fluorescence microscopy: visualizing a cell with its components. Image by the author using rxrx library

Have you ever wondered if you could classify images without any knowledge about deep learning? Nowadays great free libraries exist such as Keras and fastai in Python, or the Intelec AI software that has succeeded in abstracting deep learning so well that anyone can begin their deep learning journey, even if you have only one year of coding experience.

This article will show you how I could participate in a complex Kaggle competition about Cellular Image Classification, using just pandas and Intelec AI.


Overview and context

Drugs and medical treatments usually take many years to be developed making them expensive and thus not accessible to the public. Recursion pharmaceuticals Inc. believes in the power of Artificial Intelligence to reduce the time and the cost of producing new drugs.

A concrete way to help pharmaceuticals with artificial intelligence would be to be able to classify different cell images having received specific genetic perturbation (called siRNAs). If you can guess the drug, given images, with high accuracy, it means the drug interacts with cells in some way.

This is the exact problem of our kaggle challenge: Recursion Cellular Image Classification. Given some image, we should be able to predict which siRNA was applied to it.

Understanding the data: fluorescence microscopy

Left: Fluorescence microscopy over 6 channels. Right: Transformed image. Image by the author.
Left: Fluorescence microscopy over 6 channels. Right: Transformed image. Image by the author.

The principle of fluorescence microscopy is to use different protein that are only fluorescent in a certain spectral color range and that tends to get attached to some specific part of the cell.

The images above in the left part represent pictures taken from the same camera at the same site through six different spectral color range. This is possible thanks to the principle of fluorescence microscopy. Each frequency is represented as a black and white image. The first goal is to be able to represent them into a single RGB image, as shown in the right part.


Introduction to the challenge

Recursion Cellular Image Classification

Kaggle released a hard Classification challenge: a classification task over more than 750,000 images (representing almost 50 GB) to over eleven hundred possible classes. The image dataset is arranged in the following fashion: there are 51 batches. Each batch has four plates, each of which has 308 filled wells. Each well is monitored at two sites being photographed across six frequencies. That’s a lot! We will see in the next section how to reduce the dataset size, keeping the most relevant components. You can download the data here.

This arrangement yield the following tree data structure:

How are our data arranged? Image by the author
How are our data arranged? Image by the author

Understanding the data: What does it look like?

As mentioned before, each site was photographed across six frequencies (denoted by w1, …, w6 in the picture above). It would be nice to have 3 frequencies only, as if we were dealing with an RGB (Red Green Blue) image, because most state-of-the-art neural networks take as input RGB images, i.e. images that have exactly three channels. There are many ways to reduce some data from a higher to a lower dimension. We will explore two of them.

First things first, we will use a small and handy library to deal with this particular data created by rxrx.ai. This library will help us to visualize our data in a more meaningful way. You should clone their git repo in the same folder where you expect to run all your scripts to prepare the data. From a terminal, you can execute:

git clone https://github.com/recursionpharma/rxrx1-utils

Then, open a jupyter notebook or a python script and try the following:

A notebook with all the code is available here.

To be sure you are understanding correctly the tree structure of the data, try to change the parameters of the load_site_as_rgb function and see if you have an RGB cell image appearing!

Expected output. Image by the author
Expected output. Image by the author

Preparing the data

Since we would like to use the Intelec ai software, we need to stick to some format concerning the data structures and the label file. We will come back later on how we should install it. For now, let’s already create the label file!

If you want try it yourself, you are not obliged to try both methods below! One is enough!

Now we can finally give some definitions to our two functions!

From left to right, both images generated by the author using PCA and Colormap techniques presented below.
From left to right, both images generated by the author using PCA and Colormap techniques presented below.

PCA

Principle Component Analysis (PCA) is a common technique to apply dimensionality reduction to some data. An additional library is needed: scikit-learn. The corresponding code is the following:

Even though PCA usually shows excellent result when embedding data into manifolds in lower dimensions, this might again not be perfectly well suited in this problem because PCA is usually good when it finds linear dependencies between the features (here channels) where it might not be the case with those cell images.

Colormap

A last idea we could use would be to apply a colormap from each channel to the RGB values. For example, one can imagine that the first channel colors the output image in the red values, the second to the green values, the third to the blue values and the last three to some combination such as magenta or so on. But to know a good colormap policy we should have a deep understanding of the data, which is personally not my case. Fortunately, the rxrx library has this beautiful method called load_site_as_rgb! Remember? What it does is that we need to specify it a cell image and it applies a custom colormap to output a 3-channel image, exactly as we wanted!

One would notice that I lied a bit, we actually use the method convert_tensor_to_rgb, because the other one call their online API and is much slower than if we opened the image ourselves locally.

You might like to try your own ideas. To keep the code modular to other approaches, I suggest a last method that will let us select any functions as we want:

Now we need the main loop: Iterate through all data and transform the images into RGB images as mentioned above.

Some remarks concerning the above code:

  • The first line is what you might change if you want to try a specific suggested method (or to try your own idea of course!)
  • One would notice that at line 9 we fixed the site to "1" and are never using the second site. The reason is that we wanted to work with a lighter dataset, this is just a demo, we are not trying to win the kaggle competition (but you can use both sites if you’re motivated!)
  • Because of the fact that data is not completely defined for all combinations, there might be some images that does not exist with site 1, so we use the try-except blocks to avoid the program to crash.

That’s it folks! I promise, we did the hardest step of our journey! Now that we have our data in a good format, let’s use intelec ai!

If you want, you can check out the full notebook here.


Intelec AI installation

Intelec ai can be installed by clicking here.

Intelec AI – Automated machine learning platform with GPU support free to download

You will need docker. If you have some trouble with installing it, I invite you to check out their website.

Empowering App Development for Developers | Docker

Running its first model

Here we finally are! Yes I know you are also impatient to train your Deep Learning model without any code, don’t worry I will show you how to do it.

Upload your data

First you need to upload your chosen data, than can be either PCA or colormap or any other version of your choice.

  1. Zip your folder.
  2. Go to the "file explorer" tab of the intelec page.
  3. Upload it.
  4. Right click on it to unzip it.

Choose your model

The free version of Intelec already offers some interesting neural network. Go to the "training" tab. There is a simple and a heavier model for classification tasks. My advice is to take directly the heavier one because this Kaggle challenge is a complex task. The procedure is quite identical, either if you chose the simple or heavier model:

  1. Give a name to your classifier, e.g. Deep colored cell classifier
  2. Choose the folder called "train" which should contains another folder called "images" and your "labels.csv" file
  3. You can optionally add a test or validation folder, which I personally didn’t do for this specific task. Intelec will take some part of the training as validation per default otherwise, which is perfectly fine.
  4. You can also set up a shrink factor: the higher, the smaller the resolution, i.e. the faster the model trains but the less the accuracy. My advice is to start with a high value and then slowly decreasing it to see if it is worth it to spend more time on a model. Begin perhaps with 4 because the image resolution is already quite low (512x512x3). In my result, I actually used a shrink factor of 1 because I rented a server for this task, but you definitely don’t need to do the same.

There you go! you can save this model. Another page should open proposing you to start the training procedure. This might take a while, so be sure that your GPU is correctly set up. You can read this interesting topic if you need advice.

Model ready to be deployed. Image by the author
Model ready to be deployed. Image by the author

If you see the "Deploy" button in bold red, it means that you successfully trained your classifier, congrats! Of course you might be curious to know how well your training was. Just scroll down a bit and you should see something similar to the next picture. The tab "Training summary" button brings you to further details if you are interested.

Training and Validation loss. Image by the author
Training and Validation loss. Image by the author

To get meaningful metrics, it is really important to always measure your loss on a validation dataset, to be sure you’re not overfitting. You might have specified a test or validation set yourself, but if you didn’t, don’t worry intelec will do it for you anyway.

Awesome, your model really learned something! We can see how much the loss decreased!

If you want to go further, you can deploy the model to predict one or several test images. You can also give it a task to predict an entire folder, this is really useful if you want to participate to the kaggle competition, because almost 20,000 predictions have to be done. A nice video explains how to do all those kind of things. If you’re curious, have a look!


Conclusion

What a journey! We learned so much about how medical industry and pharmaceuticals proceed to develop new products and more importantly how deep learning can come to rescue. Indeed, we didn’t see details of implementation of our neural network’s architecture, but that is the purpose of the intelec AI software: make deep learning accessible to every one. We just needed some understanding of the pandas libraries and a bit of experience about python, and then we were able to tackle a difficult problem such as a classification task over more than one thousand classes.


Related Articles