SAEMI: Size Analysis of Electron Microscopy Images

Part 3, Training a Segmentation Model for Electron Microscopy

Published in

Towards Data Science

16 min readMay 29, 2020

This article is part 3 in my series detailing the use and development of SAEMI, a web application I created to perform high throughput quantitative analysis of electron microscopy images. You can check out the app here and its github here. Also, check out part 1 here (where I talk about the motivation behind creating this app) and part 2 here (where I give a walk-through of how to use the app). In this article, I will talk about how I trained a deep learning model to segment electron microscopy (EM) images for quantitative analysis of nanoparticles.

Choosing a Dataset

In order to train my model, I used a data-set taken from NFFA-EUROPE with over 20,000 Scanning Electron Microscopy (SEM) images taken from CNR-IOM in Trieste, Italy. This was one of the only large databases of EM images that I could actually access for free as you can imagine, most EM images are subject to strong copyright issues. The database consists of over 20,000 SEM images separated into 10 different categories (Biological, Fibres, Films_coated_Surface, MEMS_devices_and_electrodes, Nanowires, Particles, Patterned_surface, Porous_Sponge, Powder, Tips). For my training, however, I limited the training images to only be taken from the Particles category, which consisted of just under 4000 images.

One of the main reasons for this is because for almost all of the other categories, there wasn’t actually a way to obtain a useful size distribution from the image. For example, consider the EM image of fibres shown in Figure 1a. After segmenting the image, I could calculate the sizes of each fibre within the image, but you can also clearly see that the fibres extend out past the image. Therefore, the sizes I calculate are limited to what’s presented in the EM image and I can’t extract how long the fibres are from the image alone.

Fig. 1 a) EM image of fibres from dataset b) EM image of particles from dataset. source: CNR-IOM (CC-BY)

Compare that to Figure 1b of the EM image of particles where the sizes shown in the image are clearly the size of the entire particle. Not only that, but images in this category tended to have the least degree of occlusion which made labeling and training much easier.

Within the Particles category, all the images were 768 pixels tall and 1024 pixels wide. Most of the particles were roughly circular in shape, with some images featuring hundreds of particles and other images featuring only one particle. The size of the particles in pixels were also quite varied due to differences in the magnification which ranged from 1 micron to 10 nanometers. An example of some of the particle images in the data-set is shown in Figure 2 below:

Fig. 2 4 EM images of particles found in the dataset. source: CNR-IOM (CC-BY)

Removing the Banners

To begin the training process, the raw images first had to be preprocessed. For the most part, this meant removing the banners that contained image metadata while retaining as much useful image data as possible. To accomplish this, I used a technique called “reflection padding”. Essentially, I replaced the banner region with its reflections along the top and bottom. An example of this is shown in Figure 3 with the reflections highlighted in red.

Fig 3 Example of reflection padding. source: CNR-IOM (CC-BY)

In order to perform this transformation, however, I first had to detect the banner regions using OpenCV. To start, I took my image and created a binary mask of the image where all pixel values above 250 becomes 255 and all pixels below the threshold becomes 0. The code used to perform this is shown below along with the result in Figure 4.

import cv2# convert image to grayscale
grayscale = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)# create binary mask based on threshold of 250
ret, binary = cv2.threshold(gray, 250, 255, cv2.THRESH_BINARY)

Fig. 4 a) The original image b) the image after binarization

While thresholding was already fairly effective in finding the banner region, I could still detect some elements of the image that weren’t part of the banner. To make sure only the banner region was detected and nothing else remained, I used erosion and dilation on the thresholded image to find where the vertical and horizontal lines are within the image. Erosion is an image processing technique where a kernel of a given size is scanned across an image. As the kernel travels along the image, the minimum pixel value is found and the pixel value at the center is replaced by that minimum. Dilation is a similar operation but the maximum pixel value is used instead.

The code to detect vertical and horizontal lines are shown below along with their results in Figure 5.

# create tall thin kernel
verticle_kernel = cv2.getStructuringElement(
    cv2.MORPH_RECT, 
    (1, 13)
)# create short long kernel
horizontal_kernel = cv2.getStructuringElement(
    cv2.MORPH_RECT, 
    (13, 1)
)
# detect vertical lines in the image
img_v = cv2.erode(thresh, verticle_kernel, iterations = 3)
img_v = cv2.dilate(img_v, verticle_kernel, iterations = 3)# detect horizontal lines in the image
img_h = cv2.erode(thresh, horizontal_kernel, iterations = 3)
img_h = cv2.dilate(img_h, horizontal_kernel, iterations = 3)

Fig. 5 a) Vertical erosion/dilation b) Horizontal erosion/dilation

I then added the two results together and performed a final erosion + binarization is performed on the inverted array. The code and result is shown in Figure 6 below.

# add the vertically eroded and horizontally eroded images
img_add = cv2.addWeighted(img_v, 0.5, img_h, 0.5, 0.0)# perform final erosion and binarization
img_final = cv2.erode(~img_add, kernel, iterations = 3)
ret, img_final = cv2.threshold(
    img_final, 
    128, 
    255, 
    cv2.THRESH_BINARY | cv2.THRESH_OTSU
)

Fig. 6 a) Adding the vertically eroded and horizontally eroded images b) performing final erosion and binarization

With the result shown in Fig. 6b, I could finally safely detect the banner and remove it. First I found where the pixel values in the final result were 0 and put the coordinates of the top left and bottom right corners into separate variables. As well, I calculated the height and width of the banner region.

import numpy as npbanner = np.argwhere(img_final == 0)# coordinates of the top left corner
banner_x1, banner_y1 = banner[0, 1], banner[0, 0]# coordinates of the bottom right corner
banner_x2, banner_y2 = banner[-1, 1], banner[-1, 0]
# calculate width and height of banner
banner_width = banner_x2 - banner_x1
banner_height = banner_y2 - banner_y1

Next, I used these coordinates and the height/width of the banner find its “reflections” along the upper and lower edges.

# finding the reflection below the banner
bot_reflect = img[
    banner_y2:banner_y2 + banner_height // 2, 
    banner_x1:banner_x2, 
    :
]
bot_reflect = np.flipud(bot_reflect)# finding the reflection above the banner
top_reflect = img[
    banner_y1 - (banner_height - len(bot_reflect)):banner_y1,    
    banner_x1:banner_x2, 
    :
]
top_reflect = np.flipud(top_reflect)

The regions I found are highlighted in green in Figure 7 below.

Fig. 7 Regions used for reflection padding in image (highlighted in green). source: CNR-IOM (CC-BY)

Finally, I concatenated the upper and lower reflections and replaced the banner region with the reflections.

reflect_pad = np.concatenate((top_reflect, bot_reflect), axis = 0)
imgcopy = img.copy()
imgcopy[banner_y1:banner_y2, banner_x1:banner_x2] = reflect_pad

The final result (imgcopy) is shown in Figure 8 below.

Fig. 8 Final result of reflective padding. source: CNR-IOM (CC-BY)

Ground Truth Labels

After removing the banner, I then create ground truth labels using a combination of OpenCV and ImageJ. To aid in creating ground truth labels, the watershed algorithm was employed using OpenCV.

The watershed algorithm is a segmentation algorithm that treats the image as a topographical surface where the high intensity pixel values are hills and the low intensity values are valleys. The idea behind the algorithm is that you are filling the “valleys” with water and as the water level rises, you build barriers along the edges where the water would merge. These barriers then represent the segmentation boundaries of your image.

While the watershed algorithm is very useful, it also tends to be very sensitive to noise. Oftentimes, you need to perform more preprocessing on the image such as thresholding and distance transforms to obtain a good segmentation. This makes it difficult to create a segmentation algorithm based on watershed that generalizes well to a variety of images. An example of applying the watershed algorithm (with some preprocessing steps) is shown in Figure 9 below.

# remove noise using a mean filter
shifted = cv2.pyrMeanShiftFiltering(imgcopy, 21, 51)# threshold the image 
graycopy = cv2.cvtColor(shifted, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(
    graycopy, 
    0, 
    255, 
    cv2.THRESH_BINARY | cv2.THRESH_OTSU
)

# apply adistance transform and find the local maxima
dt = cv2.distanceTransform(thresh, 2, 3)
localmax = peak_local_max(
    dt, 
    indices = False, 
    min_distance = min_distance_bt_peaks, 
    labels = thresh
)
    # apply the watershed algorithm
markers, num_seg = label(localmax, structure = np.ones((3, 3)))
labels = watershed(-dt, markers, mask = thresh)

Fig. 9 Result of using the watershed algorithm

As can be seen above, the watershed algorithm alone is not adequate in creating a ground truth label. However, as a first pass approximation, it did not perform too badly. Therefore, I decided to use the code above to first provide labels for as many images as I could, then go over the labels again with ImageJ to clean up the labels so they matched the image more accurately.

ImageJ is an image processing and analysis tool written in Java designed for scientific images. An example of the ImageJ menu is shown in Figure 10 below.

I used the five selection tools on the left side of the menu to select regions of the image using my mouse, then selected: Edit → Selection → Create Mask to clean up the first pass ground truth label created from the watershed algorithm. In order to label as many images as I could, I tried to limit labeling so that as long as enough particles in the image could give a good approximation of the size distribution, the ground truth label was acceptable. Some examples of this are shown in Figure 9.

Fig. 9 Inadequate ground truth labels. source: CNR-IOM (CC-BY)

At first, I labeled approximately 750 images using this method. However, I quickly found during training that using labels of this quality did not train well and had to go over them a second time with ImageJ with greater scrupulousness. This time, I made sure that every particle in the image was labeled with as close to pixel perfect accuracy as I could.

To ensure data quality, the ground truth labels were overlaid with their corresponding image and inspected by eye to determine how well they reflected each other. Then, continuing to use ImageJ’s selection tools, I would touch up the labels with the overlay inspected at 200–300x magnification. An example of these new ground truth labels are shown in Figure 10.

Fig. 10 Final ground truth labels. source: CNR-IOM (CC-BY)

In total, approximately 250 images were fully labeled a second time around. These were put into separate directories with one directory for the raw images (from now on referred to as path_img) and another directory for the ground truth labels (from now on referred to as path_lbl). Although this is a much smaller dataset than before, it showed a much better performance when tracking the training loss due to the lack of inaccuracies between the image and ground truth labels.

Creating a DataBunch

Once enough ground truth labels were created to form my training set, I began training my deep learning model. To do so, I used fastai, a robust library built on top of PyTorch to make training using deep learning more intuitive and streamlined. The first step was to ensure my training set was organized in a way that fit with the fastai framework. To do so requires the training data and the ground truth labels to be put into a DataBunch object. I will show the code below and explain it in more detail in the following paragraphs.

# use smaller image sizes for to deal with memory issues
size = (192, 256)# batch size
bs = 24# create DataBunch
data = (SegmentationItemList.from_folder(path_img)
        .split_by_rand_pct(0.2)
        .label_from_func(lambda x: path_lbl/f'{x.stem}.png')
        .transform(
              get_transforms(flip_vert=True, max_warp=None),    
              tfm_y=True, 
              size=size
         )
         .databunch(bs=bs)
         .normalize())

The main part of this code begins at the “create DataBunch” section, where the variable data becomes the DataBunch object. It starts by gathering the raw images from the path_img directory and splitting the data into training and validation sets. This is done in the line .split_by_rand_pct(0.2) where 20% of the data is randomly selected as the validation set to prevent overfitting of the training set.

The corresponding ground truth labels for each image (ie. the y-value used in the loss function) is then gathered in .label_from_func. This applies a function to every item collected in path_img and puts all the outputs into a list. The function I applied, lambda x: path_lbl/f'{x.stem}.png', takes the image filename and looks for a matching filename (with a .png extension) in the path_lbl directory.

A series of image transformation is then applied to the training and validation set as a form of data augmentation. The transforms I used include (but are not limited to) flipping the image on either axis, rotating an image and zooming in on an image. For a full list of the transforms I used, check here. Note that I also applied tfm_y=True meaning that whatever transform we applied to the raw image is also applied to the corresponding label. This is crucial for training when doing image segmentation.

Also, a resizing transformation was applied to all of the images, ensuring they were all 192x256 pixels. The reason this transformation is applied to all of the images is unfortunately because if the images were any larger, I would run out of CUDA memory as I trained them. For a similar reason, the batch size used for training was also only 24 images at a time.

Finally the pixel values were all normalized at the end to prevent large outliers from affecting the training process.

The Dice Coefficient

The final step before training is to decide on a metric to determine how well the model is performing. One common metric used for image segmentation model is the Dice coefficient. It is the same as the F1 score, although expressed slightly differently.

The Dice coefficient is calculated by the formula:

where TP is the true positive, FP is the false positive and FN is the false negative. In this case, the true positive refers to all the positive values determined by the prediction that also occurred in the same position in the ground truth label. For instance, if the ground truth label for a 7x7 image was:

And the prediction was:

The number of true positives would be 11 ie. there are 11 ones that appear in the same position in both the prediction and ground truth label. The number of false positive would be 1 (ie. a one appears in the prediction that is not present in the ground truth label) and the number of false negatives is 2 (ie. two ones are present in the ground truth label that was not picked up by the prediction). The Dice coefficient would then be:

Training — U-Net

Now that a DataBunch object was created and a metric chosen, I could finally begin training. A common architecture used for image segmentation tasks is the U-Net architecture. A U-Net largely consists of two main parts, the down-sample path (a.k.a. the encoder) and the up-sample path (a.k.a. the decoder). It has many of the features of an auto-encoder but with the notable addition of skip connections between the encoding and decoding sections. I will explain the skip connections later but first, let’s discuss its properties as an auto-encoder. In Figure 9 below is an example of the U-Net architecture taken from the original U-Net paper. You can fine the original paper here.

Fig. 9 U-Net architecture. Source: https://arxiv.org/abs/1505.04597

The left half of the network is a standard CNN with convolutional layers followed by ReLU and max pool layers. This is a classic CNN often used for image classification problems where the objective is take an input of image pixel values and output a single classification. In image segmentation, however, the goal is to classify every pixel within the image. Therefore, your final tensor should be of the same size and shape of your original image (in our case, 192x256). This is done using up-convolutional layers in the up-sample portion of the architecture.

You can think of up-convolution (also known as transposed convolution) as convolution with extra zero padding between the pixel values. An example of this is shown in Figure 10.

Fig. 10 a) Input for up-sampling b) Input padded with zeros between pixel values c) Output matrix after convolution with a 3x3 kernel

Note that after padding, performing a convolution results in an output with greater dimension than the input. This is what allows us to have a final output that is the same size and shape of our input.

Now let’s talk about the skip connections. Going back to Figure 9, note the grey arrows that connect layers from the down-sample path to the up-sample path. These are called skip connections and they concatenate the output of the down-sample layers to the corresponding up-sample layers. The purpose of these skip connections is to allow transfer of features and spatial information learned in the down-sample path to the up-sample path. This allows the up-sample path to remember the “where” as well as the “what” when performing segmentation.

Fortunately, the fastai library provides a function to create this entire architecture in one single line.

learn = unet_learner(data, models.resnet18, metrics=dice)

For my training, I used a resnet18 CNN as the down-sample path. The data argument is the DataBunch we created earlier while the metric used to evaluate the model is the Dice metric (no worries if you don’t know what the Dice metric is, I will go over it when discussing how the effectiveness of the model was determined).

The final output after going through the entire U-Net is then a binary mask with pixel values classified as either background pixels (0) or particle pixels (1).

Training — Hyperparameter Tuning

When training a model, you invariably go through a lot of hyperparameter tuning. One of the most important hyperparameters that is pivotal to get right is the learning rate. Fortunately, fastai provides many resources that make optimizing the learning rate much easier.

After creating the U-Net architecture, a mock training was performed where the loss was calculated using various learning rates ranging from 1e-7 to 10. This was accomplished using the LR finder from the fastai library. The result was then plotted in Figure 11 below.

lr_find(learn)
learn.recorder.plot()

When determining a learning rate, you generally want to choose a value just before the plot reaches its minimum. However, keep in mind that this is just a starting point and you may still need to adjust your learning rate depending on the training process. For instance, if your loss begins increasing after subsequent iterations, you may still need to choose a lower learning rate. The final learning rate that was used to train the model was a cyclical learning rate between 2e-5 and 5e-5.

Another hyperparameter to consider is the degree to which you freeze or unfreeze your layer groups. When creating a U-Net through fastai, the weight parameters have already been trained on the ImageNet database. The idea behind this is that since many images are often related to each other, weights that have been determined using one dataset can often be transferred to another dataset with fairly good results. This allows you to train a relatively good model with a smaller number of images. For my training, I froze the first two out of three layer groups.

learn.unfreeze()
learn.freeze_to(2)

Finally, the last hyperparameters to determine was the number of epochs and the weight decay. After multiple tests and experiments, the best values for these hyperparameters was found to be 50 epochs and a weight decay of 0.1.

learn.fit_one_cycle(50, slice(2e-5, 5e-5), wd=1e-1)

A graph of the training and validation loss during training is shown in Figure 12.

Fig. 12 Training and validation loss during training

Evaluation

The final step is then to evaluate the effectiveness of the model. After training, a final third test set was created that did not participate in the training at all. The Dice coefficient was calculated using the predictions and ground truth labels on this test set and was found to be 97.5%.

As well, the final predictions were overlaid against their raw image counterparts and inspected by eye to determine how well the model performed on a given EM image. An example of this is shown in Figure 13 below:

Fig, 13 The prediction overlaid against the raw image. Screenshot taken from SAEMI

While the model does do well to predict where the particles are, it has difficulty separating particles when they are very close together in an image. Currently, the app allows you to perform some post processing in order to circumvent this issue but in the future I would like to try to improve my model by training an instance segmentation model on a more balanced dataset with a higher number of training images as well.

Summary

A training set consisting of SEM images taken from NFFA-EUROPE and processed accordingly to acquire good data. Ground truth labels were made using a combination of OpenCV and ImageJ. Training was done using the fastai library by utilizing a U-Net and the Dice metric. The final dice score as determined by a separate testing set was 97.5%.

If you want to know more about the motivation behind this training please check out part 1 in my series here. If you want to understand how to use the app I created which deploys this model, please check out part 2 in my series here. Thank you so much for reading this article and I hope you were able to get a lot of valuable information from it!