Harnessing the power of transfer learning for medical image classification

Ryan Burke
Towards Data Science
20 min readNov 26, 2020

--

image source — https://unsplash.com/@bacila_vlad

This is my first Medium article! (Yay!!!) The project is based on my capstone project for my master’s in data science. I had a lot of fun and I learned so much. I would love to here your feedback on this project if you have some free time. You can find the complete project here: https://github.com/ryancburke/blindness_detection

1. Introduction

A quick PubMed search for artificial intelligence revealed more than 15,000 scientific papers published last year. This field has seen a rapid evolution due to the significant contribution it makes on a global and individual scale. At a global level, AI is being utilized to enhance energy, agriculture, and education sectors, to name a few. At the individual level, AI is involved in most of our daily lives, whether it’s a spam filter for our email, or facial/fingerprint detection on our phones. The medical sector is also experiencing a tremendous expansion in AI. From the wearable devices tracking our health in real-time, up to governmental policies being influenced by the predictive power of modern computer algorithms. Another powerful application for AI in medicine is the field of digital pathology. Within the field of digital pathology, computer vision is a branch dedicated to teaching computers to “see”. While we may take the ability to see for granted as humans, teaching computers to “see” is a significant challenge and will be the focus of the following paper. Since this is a project focusing on computer vision, it seems fitting that the data used will involve predicting pathological changes in vision associated with diabetes.

According to the Mayo Clinic, Diabetic retinopathy (DR) is a condition affecting patients with either type I or II diabetes (figure 1). When blood glucose levels remain chronically elevated, damage occurs to the small blood vessels nourishing the eye. One of the first consequences is the vessel walls begin to weaken leading to small bulges known as aneurysms. As the disease progresses, these bulges rupture (hemorrhage), leaking blood into the eye. In later stages, the eye begins to grow new vessels to compensate for the reduced blood flow; however, these abnormal blood vessels in addition to local nerve fibers breakdown leaking lipids and proteins into the area (cotton wool spots and hard exudates). Finally, in the most severe stages, scar tissue develops and may cause the retina to detach from the eye. Ultimately, this leads to significant damage to the optic nerve fibers and blindness.

Figure 1 — Examining the consequences of DR. Left: image of a healthy retina. Right: image of pathological consequences resulting from diabetic retinopathy (Shutterstock — Image ID: 328135913).

This project had two main goals. The first was to build a convolutional neural network (CNN) from the ground up. A very elaborate approach tested the effects of modulating various parameters on classification accuracy for each layer added in the CNN architecture. The second goal was to demonstrate the power of transfer learning in the context of computer vision tasks. Transfer learning is a valuable tool in deep learning that allows individuals to access the knowledge learned from very complex models trained on millions of labeled images. Running these models can take weeks, even when multiple GPUs are used, making it not feasible for the average individual to build.

2. Methods

2.1 Environment

All scripts were written using Python 3 on a Jupyter notebook available on the Kaggle website. Access to a NVIDIA Tesla K80 GPU was provided, which increased the training speed by 12.5x according to specifications on the website.

2.2 Data set

The dataset (https://www.kaggle.com/c/aptos2019-blindness-detection) consisted of 3,662 labeled, high resolution color images for the training set, and 1,928 unlabeled for the test set. Within each set images were classified into 5 groups according to the severity of DR present. Label 0 represents the control group. Labels 1–4 represent mild, moderate, severe, and proliferative DR, respectively. A visual summary of the distribution of diagnoses is found below (figure 2a). A clear imbalance of group sizes is evident with more than 1,800 images representing the control group (label = 0) and less than 300 in the most severe category (label = 4). Although this imbalance is expected in real-world data, this poses a problem for many machine learning models. In addition to unbalanced classes, images within the dataset differ in size (figure 2b).

Figure 2 — Left a) Distribution of diagnoses in the dataset. 0 = no DR (n = 1805), 1 = mild DR (n = 370), 2 = moderate DR (n = 999), 3 = severe DR (n = 193), 4 = severe/proliferative DR (n = 295). Right b) Random sample of images present in the training set highlighting the different shapes present. (image by author)

2.3 Image preprocessing

2.3.1 Resizing

When defining the model architecture, which will be explained in detail in a subsequent section, one of the requisites is to define a fixed input shape. When performing this task, it is important to keep in mind that there is a balance between computation speed and information loss. To elaborate, when you reduce the size of an image, you are removing information (pixels). Less information means faster training times; however, it also may mean reduced overall accuracy. To be compatible with the pretrained models used for transfer learning, an image size of 224 x 224 was selected.

2.3.2 Data augmentation

Unbalanced group size is a common problem in machine learning. When training our models, the goal is to see an improvement in accuracy over subsequent iterations (epochs). Since the model learns by finding patterns to distinguish groups from one another, underrepresented groups will be seen less often and will therefore not be learned as well as their overrepresented counterparts. To mitigate the consequences of over/underrepresentation, data augmentation is used. By adjusting specific parameters, random changes are applied to the original training images. These random alterations are applied each epoch meaning the model will train on “different” images every iteration. The specific protocol used for this project was selected from the complete list found at https://keras.io/preprocessing/image/. As a summary, a range was selected for random rotations (up to 30°) and zoom (up to 10%). Horizontal and vertical flips were also set to true (figure 3).

Figure 3 — Left is an original image from the dataset. The four smaller images are visualisation examples for data augmentation protocols used. Top left — horizontal flip, top right — rotation, bottom left — vertical flip, bottom right — zoom

2.4 Building a convolutional neural network (CNN)

This section will provide an overview of how CNNs learn and describe how I approached building a model from scratch. I will describe important parameters selected and manipulated within the CNN architecture to maximize the prediction accuracy.

2.4.1 How a CNN learns

A CNN is composed of several layers chained together to generate predictions from input data. To visualize how this works, bring your attention to figure 4 below. Images are stored as a 4D tensor (figure 4a) taking the shape of (samples, height, width, depth) where depth refers to the number of channels. A sliding window (figure 4b) then moves across each 3D feature map (image) where it stops at every possible location extracting features. Each extraction is subsequently transformed by a convolution kernel (learned weight matrix) into a 1D vector. All vectors are then reassembled into a 3D output map. Information learned in this layer is then fed into the next layer where the process repeats until the final convolution layer. The final layer connects to a fully connected (dense) layer where predictions are made. The above process repeats on small batches of input data until the training set has been “learned”. This is referred to as an iteration (epoch) (figure 4c). Next, a loss function computes a loss score by comparing predictions to target responses. This score is used by an optimizer to update weights through a specified variation of stochastic gradient descent to minimize the loss score.

Figure 4a-c. Visualizing how a convolutional neural network learns. a) Left: Image data is stored as a 4D tensor with the shape of (samples, height, width, depth). Depth refers to the number of channels (1 for greyscale, 3 for RGB). b) Center: a sliding window (black square) moves across the feature map extracting information from every possible location. c) Right: Information learned from each convolution layer propagates forward until a loss function compares predicted and target responses which gives a loss score. This score is used by the optimizer to update weights. (image by author)

2.4.2 Layers, filters, optimizers, activation functions

2.4.2.1 Layers

A typical CNN model contains a convolutional base and a densely connected classification/prediction layer. Although there are no concrete rules on building the best CNN architecture for every task, in general the base consists of alternating layers of convolution and max pooling layers. Max pooling is an aggressive down-sampling method to reduce the number of feature-map coefficients and induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows (fraction of the original input they cover). Whereas the sliding window used in convolution is generally a 3 x 3 or 5 x 5, the window size in a max pooling layer is typically 2 x 2. To understand the consequence of these window sizes and how they are implemented, consider the following (figure5). The original input images have a shape (224, 224, 3). Each convolution layer results in the image size being reduced by 2 in both dimensions while the depth value also changes. In contrast, following each max pooling layer, both dimensions are reduced by a factor of 2 (figure 5a). How does the max pooling layer reduce the feature map more than a convolution when the former has a 2 x 2 window and the latter has a 3 x 3 window? This occurs because of a parameter called stride. In a convolution layer, the stride is generally set to 1 meaning the window will stop at each possible location that allows for the border of the window to remain on the feature map. In a 6 x 6 feature map (figure 5b) there are 16 possible locations the window can extract information. Conversely, max pooling uses a stride that is equivalent to the window size. For a 6 x 6 feature map and a 2 x 2 window, there are 6 possible locations for the window (figure 5c). In sum, the convolution operation will result in a 4 x 4 output feature map (reduction by 2 in each dimension), whereas max pooling will result in a 3 x 3 output feature map (reduction by a factor of 2).

Figure 5 a-c . How the feature space is altered by convolution and max pooling. a) Left: Reduction in feature space. A reduction in feature space is observed following each convolution and max pooling layers. b) Top right: A sliding window (black square) extracts information from every possible location while maintaining its borders within the feature map space. c) Bottom right: Max pooling uses a stride equal to the window size resulting in down-sampling by a factor of 2 (represented by different colors). (image by author)

2.4.2.2 Filters

In the previous section, I mentioned that the depth dimension changed in addition to the width and height dimensions. While the depth refers to the channels for the input layer, after the initial convolution it refers to a parameter called filters. A filter represents a learned pattern (weights having a spatial relationship). To visualize examples of these filters, refer to figure 6. The leftmost image is an example from the dataset. One of the filters (center) was extracted from the original image and the feature map (right) is the result of applying that filter to the image or another feature map for deeper layers. Referring back to figure 5a, a shape of (222, 222, 16) states that we have requested 16 filters to be extracted from that layer. Typical values for the first convolution layers are 16 or 32 with each subsequent layer being increased by a factor of 2. Deeper layers have greater numbers of filters due to the increasing complexity of features being extracted. Filters in the first layer extract information like dots, edges, or corners. Filters in subsequent layers are combining patterns from previous layers to create larger more complex patterns. To put this into perspective, imagine the information a 3 x 3 grid can extract from a 222 x 222 image. As the image size is reduced through convolutions and max pooling, the fraction of the image seen by the 3 x 3 grid increases, thereby increasing the complexity of the information extracted.

Figure 6 — Visualizing CNN filter and feature map. A filter is a 2-dimensional grid of learned weights that have a spatial relationship. The central 3 x 3 grid was extracted from the original image (left). The feature map (right) represents the result of applying the filter to the input image (or another feature map). Lighter pixels indicate greater weight.

2.4.2.3 Activation function

One of the reasons deep neural networks can learn such complex information is the inclusion of non-linear activation functions. Activation functions define the information that propagates from one layer to the next. Whereas linear functions are easier to train they lack the ability to learn complex mapping functions. For this reason, linear functions are used exclusively in the output layer for specific problems such as regression. Traditionally, the sigmoid and hyperbolic tangent (tanh) functions were widely adopted with the latter providing easier training and better predictive performance (Nwankpa, Ijomah, Gachagan, & Marshall, 2018). In more recent years, these have been replaced by other functions that have demonstrated better performance. As with the linear functions, the sigmoid and tanh are restricted to the output layer where they are used for binary classification problems. Using a base model, the following activation functions were compared to determine which was used when fine-tuning the model: elu, relu, selu, softmax, softplus, and softsign. For greater detail into each of these activation functions, refer to the activations section on the keras webpage (https://keras.io/activations/).

2.4.2.4 Optimizers and loss functions

For the first epoch, we have no information as to what weights would be optimal to minimize the error; therefore, these are randomly initialized. Thus, for the initial output (prediction), we would expect a large error rate (loss score). This loss score is calculated using our loss function. Since this project is a multi-class, single-label classification problem, categorical cross-entropy was the appropriate loss function. The loss score provided from the loss function is then used by the optimizer to update weights for subsequent epochs. Using a base model, the following optimizers were compared to determine which was used when fine-tuning the model : adadelta, adagrad, adam, adamax, nadam, RMSprop, SGD. For greater detail into each of these activation functions, refer to the optimizers section on the keras webpage (https://keras.io/optimizers/). As an overview, each of these optimizers are variations of the gradient descent algorithm. In gradient descent, the gradient is the steepest point. Finding the steepest gradient is important to minimize the number of iterations required to achieve the lowest loss score (minimum). In principle this sounds simple yet many factors that must be considered (figure 7). First, the size of the step (learning rate) must be considered. A very small learning rate will require many iterations to reach the minimum, which will ultimately slow the model down. Conversely, a learning rate that is too large may never find the minimum loss function. But what happens when multiple minima are present? This situation is much more complex. If the learning rate is too small, the model may get stuck in a local minimum.

Figure 7 — Optimizing loss function using gradient descent. Using variations of the gradient descent algorithm, optimizers attempt to find the most efficient path toward the minimum loss function. Care must be taken to select the appropriate learning rate because if too small it may take a long time to reach the minimum loss (left) or in more complex tasks (right) may get stuck in a local minimum. Conversely, if too high (center) the model may never find the minimum loss function. (image by author)

Optimizers use two main tools to overcome this issue, although their approach may differ. First, a parameter called momentum can be applied. Classical momentum adds a velocity component that increases with larger drops in the loss score. As an analogy, we can imagine it as pushing a ball down a hill to help prevent it from getting stuck in a local minimum. An extension to this is Nesterov momentum. Whereas classical momentum makes a correction to its velocity prior to its next step, Nesterov momentum takes an intermediary step in the velocity direction but makes a correction based on a future location. The second tool can be to adjust the learning rate. This parameter can be altered manually, or for adaptive optimizers their learning rates and momentum can be continually updated and changed based on network performance.

2.4.2.5 Fine-tuning top performing model

Kernel size

When we discussed the role of the sliding window for convolution, it was stated that 3 x 3 and 5 x 5 were the most typical sizes. Window sizes of 7 x 7 also are used, albeit rarely. After building my top performing model, I tested each of these kernel sizes to determine which would produce the highest performing model.

Dropout

When building a deep learning network, one of the guidelines offered is to continue to add layers until the model overfits. Overfitting refers to a model displaying a significant performance discrepancy between training and validation sets. This occurs because the model is beginning to memorize the training set which means that it will not be generalizable to data that has not previously been encountered. Dropout is a parameter that is commonly used to reduce overfitting. As the name suggests, during training, a random number of neurons are dropped (figure 8). The result is akin to training many neural networks with different architectures in parallel because each update is performed with a different set of randomly dropped neurons. Typical dropout layers use values ranging from 10–50%. For this project, 10%, 30%, and 50% dropout were explored during the fine-tuning process to determine the effect on network performance.

Figure 8 — Dropout as a tool for regularization. A visual depiction of a 50% dropout layer on a 4 x 4 grid. (image by author)

Batch normalization

One reason training deep neural networks can be challenging is because the input from previous layers can change after each batch when the weights are updated. This can be problematic for very deep models resulting in an increase in required number of iterations and reduced performance. This is referred to as internal covariate shift. Batch normalization is a method that standardizes the inputs to a layer for each batch. The result is an acceleration in training, and in some cases, a reduction in error. This technique will be evaluated for its effect on performance during the fine-tuning of my model.

2.5 Transfer learning

A highly effective approach to performing computer vision tasks on small datasets is to employ a technique known as transfer learning. Here, a network that has been pre-trained on a very large image dataset is saved. ImageNet is one such database (http://image-net.org/about-overview) that provides access to tens of millions of hand-labeled images for computer vision tasks. Each year ImageNet hosts a challenge meant to advance research in computer vision by promoting research into state-of-the-art algorithms. The top models each year are highly complex architectures ranging from several dozen to several hundred layers. Without access to vast amounts of computing power, training these models on a personal CPU is impossible. Because these models have been trained on such large datasets, the spatial hierarchy of features learned by the pretrained model can act as a generic model of the visual realm. There are two ways that are commonly used to harness the power of transfer learning: feature extraction and fine-tuning.

2.5.1 Feature extraction

Feature extraction refers to using the convolutional base from a pretrained network and then running the model through a new classifier (dense layer) trained from scratch. The reason the dense layer from a pretrained model is not used is because it will be specific to the task on which the model was trained. The convolutional base, however, will contain useful maps of generic concepts that can be transferred between tasks even if they are unrelated. The convolution bases from Visual Geometry Group (VGG16), ResNet50, and Xception will be evaluated to determine which results in maximum accuracy.

2.5.2 Fine-tuning

As previously discussed, early layers in the convolution base tend to extract less-abstract concepts such as edges and dots whereas higher layers tend to extract features which may be more task-specific. Fine-tuning a pre-trained model involves unfreezing a few of the top layers while leaving the rest of the base frozen. Doing so may improve the model’s performance by adjusting these abstract layers which would be more relevant to the problem at hand. Since ImageNet includes mostly everyday items such as animals, people, cars, etc., fine-tuning is hypothesized to be a valuable tool for improving model accuracy. For each of the three pretrained models tested, the effects of unfreezing the final two layers on performance was evaluated.

2.5.3 Classification layer

Two methods were evaluated for modeling the classification layer added after the pretrained base. The first method involved adding a dense connected layer that fed into the classifier. In addition to this, the effects of batch normalization and dropout regularization were evaluated. The second method was to use a global average pool layer instead of a dense connected layer plus a dropout layer. Global average pooling is like max pooling but instead calculates the average output of all feature maps from the previous layer.

3. Results

3.1 Building a CNN from scratch

3.1.1 Selecting activation and optimizing functions

The first step in the process of building my CNN from scratch was to determine which optimizers and activations functions performed best on the task at hand. The seven optimizers were each tested for performance using the six activation functions used for the classification layer. Each of the 42 networks were tested on a base model consisting of 3 alternating convolution and max pooling layers. The convolution filters increased (16, 32, 64) with each layer. Output from the final layer fed into a dense layer and subsequently into the classification layer. Activation functions for all hidden layers remained constant with relu. Refer to table 1 below for results from each trial. As a summary, the softmax activation function outperformed the rest, while the optimizers adam and adamax outperformed the others. Despite adamax performing marginally better than adam, I opted to continue with adam. The reason I selected adam over adamax is due to the prevalence in the literature. By selecting adam, which is far more common, the results will be easier to compare with other models.

Table 1 — Results following training on a base model after 20 epochs prior to data augmentation. A = accuracy, V = validation accuracy. Activation functions listed refer to the classification layer only. All hidden layers used relu for consistency. A white square was used to highlight the activation function and optimizer that will be used for further testing. (image by author)

3.1.2 Determining optimal number of convolutional layers

After selecting the top optimizer and activation function, the next step was to determine how many layers to add in the convolutional base. Results from models containing 3 to 6 convolution layers are presented below (Table 2). Despite all performing very similar, the model with four layers appears to present an advantage. Note in the figures within the table, at the end of the 4-layered model, both the accuracy and loss seem to be suggesting that with additional epochs, a greater performance would be expected. This is in contrast with the other three models which seem to be plateauing toward the end of their respective trials.

Table 2 — Determining the optimal number of convolution layers. After trials using a range of 3 to 6 layer for the convolutional base, the 4-layer model appears to suggest an increased performance with additional epochs. A = accuracy, L = loss, V = validation accuracy, VL = validation loss. (image by author)

3.1.3 Evaluating the role of kernel size in model performance

With a decently performing model, next the influence of kernel size on performance was evaluated. The results from comparing 3 x 3, 5 x 5, and 7 x 7 kernel sizes are presented in Table 3 below. All three models performed very similarly in terms of both accuracy and loss score. Visually, however, the 3 x 3 and 7 x 7 kernel sizes appear to have a greater slope suggesting that there would be an increase in performance with additional training epochs. Toward the end of the 5 x 5 kernel sizes both accuracy and loss appear to be plateauing. Moreover, the training and validation sets seem to be diverging indicating that this network is beginning to overfit. Choosing between the top performing kernels was somewhat arbitrary. In the end, the 3 x 3 kernel size was selected. When classifying images, generally it will be the small local features that will be important to distinguish one class from another. By choosing the 3 x 3 kernel size, this maximizes the extraction of important information

Table 3 — Determining the optimal kernel size. After trials using 3 x 3, 5 x 5, and 7 x 7 kernel sizes, the model with the 3 x 3 kernel was selected. A = accuracy, L = loss, V = validation accuracy, VL = validation loss. (image by author)

3.1.4 Refining the model using dropout and batch normalization

Up to this point we have maximized the image classification model as much as possible. Although it is not an outstanding model, the fact that it performs equally well on the validation set suggests that it may be generalizable. Although the model does not appear to be overfitting, manipulations of dropout (10%, 30%, 50%) and batch normalization were explored to see if they offered any performance advantage. Below, the results from these trials are summarized in Table 4. Once again, there is evidence that the model has achieved its maximum performance. Following visual inspection, trials using 10% and 30% dropout appear to be demonstrating a divergence in their respective training and validation sets. This divergence is not evident using a 50% dropout. Finally, whereas batch normalization alone had no effect on network performance, when combined with a dropout of 50%, a significant reduction in variability between epochs is evident.

Table 4 — Fine-tuning CNN model using dropout and batch normalization. Dropout rates of 10, 30, and 50% were compared for effects on network performance. All three performed similarly, however, a divergence in training and validation data was evident only in groups using 10% and 30% dropout. No improvement was observed using batch normalization, yet when combined with a 50% dropout rate, there was a significant reduction in inter-epoch variability. A = accuracy, L = loss, V = validation accuracy, VL = validation loss. (image by author)

3.1.5 Summary

This section explored the steps to building a CNN from scratch for a multi-classification computer vision task. Beginning with the selection of non-linear activation filters and optimizers, it was determined that the softmax activation function combined with adam as the optimizer performed best. Next, alternating convolution and max pooling layers were added sequentially to determine the best general architecture for our task. It was determined that a convolutional base with 4 layers outperformed bases with 3, 5, and 6 layers. After investigating the performance effects of kernel size on the base model, a 3 x 3 window was selected for maximal extraction of local information. Finally, when comparing effects using batch normalization and a range of dropout rates (10–50%), it was a 50% dropout rate which appears to demonstrate the best performance following visual inspection. Section 3.2 will explore transfer learning comparing VGG16, ResNet50, and Xception pre-trained models. Below is the code for the top performing model built from scratch.

3.2 Transfer learning

This section will explore how transfer learning can be utilized to enhance the diagnostic success for computer vision tasks using two techniques. Specifically, the first will employ a technique called feature extraction by freezing the convolutional bases and training them with either a dense layer (dense ➜ dropout ➜ classifier) or a global average pooling layer (pooling ➜ dropout ➜ classifier). The second technique, called fine-tuning, will compare performance after unfreezing the final two blocks.

3.2.1 Feature extraction

When the technique using the dense layer was compared to that using the global average pooling layer, the former performed significantly better for all three pretrained models (Figure 5). In each case, global average pooling performed worse than the model presented in the previous section. For those using a dense layer, ResNet50 and Xception both had greater accuracy; however, they both significantly overfit the model evident when comparing validation scores and losses. The VGG16 using the dense layer was the only model to avoid overfitting the data, yet it fell short of the performance obtained in the model I built. Below you will find an example of the code to load and run VGG16 model (for feature extraction).

Table 5 — Results from trials comparing feature extraction between pretrained models. Trials using a dense layer performed significantly better that those using global average pooling for all pretrained models. Overfitting was problematic for ResNet50 and Xception. A = accuracy, L = loss, V = validation accuracy, VL = validation loss. (image by author)

3.2.2 Fine-tuning the convolution base

Contrary to the results from feature selection, it was the global average pooling layer that outperformed the dense layer when the top two convolution blocks were unfrozen. All models using a dense layer overfit the data as evident from the divergence in training and test data. This was also true for those using a global average pooling layer, albeit to a lesser extent. Despite this, the highest accuracy was observed with ResNet50 using a dense layer, achieving 94% accuracy on the training set and 82% on the validation set. Below you will find an example of the code to fine-tune the final 2 convolution layer of the VGG16 model

Table 6 — Results from trials comparing fine-tuning between pretrained models. Global average pooling outperformed dense layer as a whole; however, it was a dense layer with ReNet50 that achieved the highest overall accuracy at 94%. A = accuracy, L = loss, V = validation accuracy, VL = validation loss. (image by author)

3.2.3 Summary

Whereas feature extraction using the pretrained models resulted in a greater accuracy to the model built from scratch, they were prone to significantly overfit and performed very poorly on the validation set. As a whole, models using the dense layer to feed into the classifier significantly outperformed those using a global average pooling layer. The VGG16 network was the only one to avoid overfitting the data; however, it was not able to perform better than the model I built in section 3.1. The advantage of transfer learning becomes clear when the pre-trained models are fine-tuned. All models outperformed my model by approximately 5% using VGG16 to almost 20% using ResNet50 for accuracy, and they all increased validation accuracy by approximately 10%.

4. Discussion

The purpose of this project was to demonstrate the usefulness of transfer learning for computer vision tasks. Whereas feature extraction was prone to overfitting the data and performed poorly on the validation set, fine-tuning the pretrained models gave impressive results. The fact that feature extraction performed poorly should not come as a surprise when considering the database on which VGG16, ResNet50, and Xception were trained. Imagenet is a massive image database, however, the images are all everyday images such as animals, people, etc. The weights from deeper layers of these models would certainly represent very different featurs than those in the present images. By unfreezing some of the later layers, all of the most useful generalizable features were retained. The ability to harness the power and depth available from state-of-the-art models is currently shifting the medical landscape. One recent study published in Nature Digital Medicine, on a similar task as the one presented here, reported a 97% sensitivity using their deep learning model compared to 74% for professionals (opthamologists, opthalmic nurses, and technicians) with > 2 years experience (Ruamviboonsuk et al., 2019). Over the next decade, as medical image data becomes more available, the field of digital pathology will explode ushering in the emerging field of personalized medicine on a global level.

5. References

  1. Nwankpa, C. E., Ijomah, W., Gachagan, A., & Marshall, S. (2018). Activation Functions : Comparison of Trends in Practice and Research for Deep Learning. ArXiv, 1–20.
  2. Ruamviboonsuk, P., Krause, J., Chotcomwongse, P., Sayres, R., Raman, R., Widner, K., … Webster, D. R. (2019). Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. Nature Digital Medicine, 25(March), 9.

6. Helpful links

  1. I would like to thank the authors found in the link below for their insight and wisdom. I learned SO much! https://www.kaggle.com/c/aptos2019-blindness-detection/discussion/105305
  2. Jason Brownlee has great tutorials on essentially everything! Check his work our here https://machinelearningmastery.com/
  3. I can’t recommend Andrew Ng enough. I have audited some of his courses on coursera and he’s by far the best teacher on this subject I have come across. He also has YouTube videos if you’re interested. https://www.youtube.com/watch?v=PPLop4L2eGk

--

--