Making Sense of Big Data
If you are asked to build a bookshelf, your guess on what you need would not be far off, but can you imagine what you will need to build a two-story house or even a building like Sagrada Familia? I had no idea how each building block was designed and carved until I joined a tour. Convolutional Neural Network (CNN) is surely a major building block for Computer Vision models, like stones for Sagrada Familia; however, it is not enough to analyze the architecture of SoA (state of the art) models. In this blog post, I want to invite you to a tour to see each building block of the current SoA model in Computer Vision (as of Feb 1, 2021). The target is people who do not routinely track SoA models, and the purposes are to provide anatomy on the modern model architecture without showing math equations and to lead the readers to find any domain they want to study further based on their background and passion.
The architectures of SoA models have been getting exponentially more complicated in Computer Vision, and the progress in this domain does not only come from the improvement in the model architecture but also from the data augmentation. Google Research and Brain teams have submitted a paper applying a random Copy-Paste data augmentation, and they have retrieved the first place in the leader board on Object Detection with COCO test-dev dataset from Yolov4-P7 at paperswithcodes.com as of Feb 1, 2021.
The name of the model is Cascade Eff-B7 NAS-FPN with Self-training and Copy-Paste. Each term of the model name includes the contribution of great researchers. To explain the model architecture in a structured way, I have split the topics into two domains: Model Architecture and Data Augmentation. Let’s dive into each building block.
First, this lengthy-named model is decomposed into Cascade Eff-B7 NAS-FPN (top half) and Self-training Copy-Paste (bottom half). Second, Cascade Eff-B7 NAS-FPN is further broken down to Cacade R-CNN (2018), EfficientNet B7 (2020), and NAS-FPN (2019). Third, NAS-FPN includes two components: NAS (Neural Architecture Search, 2017) and FPN (Feature Pyramid Network, 2017). Then, let’s look at the bottom half for the Data Augmentation. Self-training Copy-Paste (2020) was contributed by two researches: Self-training (2020) and Copy-Paste (2018). The schedule of our tour visiting these components will be in the following order.
Table of Contents
i. Data Augmentation
- Copy-Paste Augmentation
- Self-training Copy-Paste
- When should we apply Self-training Copy-Paste?
ii. Model Architecture
- EfficientNet
- FPN (Feature Pyramid Network)
- NAS (Neural Architecture Search)
- NAS-FPN
- Cascade R-CNN
iii. Closing
Data Augmentation
The purpose of data augmentation is to increase the size of the dataset by applyinh synthetic transformations to the existing images. A traditional approach in data augmentation is the combination of random cropping, rotation, scaling, horizontal or vertical transformation. Other ways are changing the contrast or brightness of images. More advanced approaches can be to populate rain effect, sun flare, or even adversarial noise. When you compound these transformation approaches, the augmented data become like below. As a result of appropriate data augmentation, the performance of a computer vision model is expected to become more robust and accurate.
If you have not used data augmentation, the post by Sumit Sarin is a good starting point because it shows introductory examples with codes. Also, you can find a variety of traditional and modern data augmentation approaches for both images and audio at AgaMiko’s repository.
Copy-Paste Augmentation
We are finally going to visit the paper, "Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation." Some readers who have been tracking this domain would think a copy-paste approach has been there since 2018! The authors discussed that the differences from related works are 1) large scale jittering (LSJ), 2) self-training copy-paste, 3) simple random copy-paste without taking into account the context of the image background, and 4) no geometric transformations (like rotation) applied. Let me dig deeper into the first two points. As you see in the image below, the cropped instances (not the entire bounding boxes) from one image were pasted at random locations on another image with large jittering. Usually, the standard scale jittering is from 0.8 to 1.25, but the authors got a significantly better model performance by applying large scale jittering from 0.1 to 2.0, which is pretty interesting.
Self-training Copy-Paste
The copy-paste approach is applied to additional unlabeled images as well as the supervised data with labels. The research has used the labeled COCO dataset (118K images) as well as the unlabeled COCO dataset (120K images) plus Objects 365 dataset (610K images). The process is pretty straight-forward.
- Train a supervised instance segmentation model with Copy-Paste augmentation on labeled data;
- Generate pseudo labels on unlabeled data using the trained model in step 1;
- Paste ground-truth instances into pseudo labeled and supervised labeled images and train a model on this new data.
The key in step 3 is that we only paste instances from the ground-truth instead of pseudo labeled instances. You can learn more about Self-learning from the paper Rethinking Pre-training and Self-training by Google Brain. According to this paper, self-training can enable machine learning methods to work better with fewer data. The pseudo labels by self-training may annotate better than a human does for some complicated instances.
Please look at the section "Visualization of Pseudo Labels in Self-training" in the appendix since way more examples of pseudo labels are available.
When should we apply Self-training Copy-Paste?
As you might have already realized, you need to annotate both bounding boxes and instances even if your end goal is to develop an object detection model. The authors found the increase in Box AP of applying copy-paste with self-training is higher when the fewer labeled dataset is available comparing with the corresponding baseline models. The efficacy was 4.8 Box AP (bounding-box average precision) over the baseline model on top of large scale jittering when only 10% out of the labeled COCO dataset was used. Because the effort is significantly large, it is more realistic to use this approach when your dataset is small and your project has abundant resource who does annotation.
Model Architecture
We have learned about the data augmentation techniques used on "Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation." Let’s continue our tour to understand the engine of the paper, Cascade Eff-B7 NAS-FPN. Let me only focus on the big picture of each component how it works. I selected the order of the topics below to cover more important building blocks earlier than the rest.
EfficientNet
EfficientNet was developed by Google in 2019 as an amazingly powerful model creation approach. It is designed to determine the optimal architecture of the neural network corresponding to the dataset given. These used to be common questions we had faced.
Should we increase the depth of the neural network? Should we increase the number of channels in each layer? Or, should we increase the resolution in each layer?
EfficientNet controls depth, width, and input resolution values to efficiently scale up the baseline neural network. As well as these control variables, we need to pay attention to a resource constraint in an optimization model. For example, a budget is a constraint for a revenue optimization model. If we can ignore the budget, we should select more labor and more capital to reach the higher production level. In EfficientNet, we set the multiplication of depth, width, and resolution values that must be less than a certain threshold value. With this resource constraint, we can efficiently achieve the optimal neural network architecture by limiting the computation power required. For example, if the model increases both depth and width, it must decrease the resolution to satisfy the resource constraint so that the model does not demand more computation resources than we can afford.
ConvNets Scaled Efficiently is a great resource to understand the concept of EfficientNet.
EfficientNet-B7
You would still have a question; what is B7 at the end? The complexity of the model can be defined by FLOPs (floating point operations per second). FLOPs represent the number of addition or multiplication executed per second. For example, EfficientNet-B0 does 0.39 billion FLOPs whereas B7 does 37 billion FLOPs. This means B7 has a higher budget/resource in the constraint but B7 also requires higher computation power to find the optimal value out of the search space. You can remember larger value after "B" requires a higher spec machine and usually a larger volume of the dataset as an input to finetune the pre-trained model.
FPN (Feature Pyramid Network)
Detecting objects with different sizes in an image has been a fundamentally difficult problem in computer vision. People used to use Scale-invariant feature transform to solve the same problem without neural networks. Feature pyramids are used to detect objects at different scales. FPN contributes to having the system recognize a small dog in a photo standing far behind and a large dog sitting in front of a photographer can belong to the same "dog" category. SSD (Single Shot MultiBox Detector, 2016) was another useful approach to overcome this problem by creating feature maps in different scales. According to FPN (2017) paper, FPN has two advantages. First, it reuses the differently scaled feature maps to carry over the benefits of SSD in the bottom-up pathway. Second, FPN amalgamates two encoded insights: one with semantically strong but coarse resolution from the top-down pathway and the other with semantically weak but high resolution from the lateral connection to enhance the performance in object detection to boost the model performance.
NAS (Neural Architecture Search)
NAS (2017) is an approach to automate the process to find the optimal architecture that maximizes the key metric out of all possible architectures using search algorithms. As a recap for the people who are not familiar with modern architecture, the classical neural networks were linearly connected (left model below), but the modern neural networks have branches and skip-connections (right model below).
Let me use an optimization process of DARTS (Differentiable Architecture Search) to illustrate an example of a cell search algorithm. The four figures (a) to (d) below show the steps to run the optimization. First, (a) is the initial state in the optimization. Each box with a number is a node, and nodes are connected by a directed edge, which is a one-way path. Node 0 is an input and node 1 is an output.
Before running the search algorithm, we do not know how nodes should be connected including skip-connections. An example of a skip connection is between Nodes 0 and 3. Skip connection is a very important concept used in ResNet (2015) to overcome the vanishing gradient problem, and this is a good resource to study it. The different colors of edges between nodes represent different types of convolution layers and max-pooling layers. DARTS identifies the optimal architecture in the following steps. In status (a), DARTS does not know about edges. In status (b), it places a mixture of candidate operations on each edge, and it solves bilevel optimization problems using candidate edges in status (c). Finally, it induces the final architecture in status (d). The baseline network of EfficientNet was developed by MNAS (Mobile Neural Architecture Search).
NAS-FPN
Authors of NAS-FPN (2019) have challenged optimizing the architecture of FPN. The architectures designed by NAS-FPN are no longer symmetrical like the original FPN architecture but it still contains the structure of top-down and bottom-up connections to fuse features across the scale. The architecture optimization is done by the combination of scalable search space and Neural Architecture Search (NAS) algorithm to overcome the large search space of pyramid architecture. A key decision in the process of optimization is whether two arbitrary feature maps (high-level features with strong semantics and low-level features with high resolution) need to be merged by one of the two binary operations: summing up or global pooling. Figure (a) below is the plain-vanilla FPN, and different FPNs developed by the NAS algorithm are represented below with average precision (AP) as detection accuracy to compare architectures.
Reinforcement Learning is used to optimize the model architecture. The decisions of how to merge nodes are made by a controller Recurrent Neural Network (RNN) because the order of building blocks matters. The controller samples child networks with different architectures. As you can imagine, these experiments require tremendous computational power, and the authors had used 100 Tensor Processing Units (TPUs) in their experiments. The resulting AP on a held-out validation set is used as the reward to update the controller. Most of the unique architectures converged after approximately 8,000 steps. Finally, the following is the architecture with the highest AP from all sampled architectures during RL training in the experiments.
Cascade R-CNN (Region-based CNN)
Cascade R-CNN was developed in 2018. A traditional problem in Object Detection is choosing the right intersection over union (IoU) threshold to train a model and to run the inference. If the object detector is trained with a low IoU threshold (like 0.5), it usually produces noisy detections, and detection performance tends to degrade with increasing the IoU thresholds because higher IoU tends to create false negatives by tightening boxes. The major contribution of Cascade R-CNN is to alleviate this problem because a sequence of detectors can adapt increasingly higher IoUs based on its architecture below.
Cascade R-CNN has four stages, one RPN and three RCNNs for detection with increasingly higher thresholds. The experiments on the paper used 0.5, 0.6, and 0.7 IoU thresholds at the stages. The authors overcame overfitting by handing a coarse bonding box proposal to the next stage to refine it by increasing the IoU threshold at each stage. As a result, the performance in both localization and detection would be better than the models with one IoU threshold value. A detailed paper review is available at Reading: Cascade R-CNN – Delving into High Quality Object Detection (Object Detection).
Closing
I had originally planned to write a blogpost about Scaled Yolo v4 with Cross Partial Network (CSP) right after it became the best model in Object Detection; however, Cascade Eff-B7 NAS-FPN with self-training Copy Paste has overwhelmed the rest of the models in the leaderboard. I have been thrilled to see it has been ranked in the first place since the middle of December 2020. The competition in Object Detection in 2020 was intensive, and the models in the first place usually had changed at least once a month. On top of these competitions of CNN models, Vision Transformer was introduced by Google and I’m expecting this transformer-based architecture will prosper in 2021. Please read my previous post if you’re interested in the Transformer architecture in NLP.
I hope you enjoy my tour. Even though Scaled Yolo v4 with CSP is no longer the best model, I like the framework of the Yolo-series to train models because they are pretty user-friendly. I have some ideas about our next tour. It could be about the tutorial of finetuning models using Scale Yolo v4 with CSP with codes or the paper review about a new EfficientNet-based model like EfficientNet Cascades. I listed the links to the Github repositories of the useful models I introduced in this post so that you can play with them!