A Compact CNN for Weakly Supervised Textured Surface Anomaly Detection

In this article, I’ll be discussing a paper [1] that proposes a compact convolutional neural network (CNN) for detecting anomalies/defects from weakly/coarsely labelled data. The article is organized as follows.
- Introduction
-
Methodology ◦ Segmentation Network ◦ Classification Network ◦ Architectural Specifications
-
Experimental Setup ◦ Loss functions ◦ Optimizer ◦ Dataset ◦ Training setup
-
Results ◦ Quantitative Results ◦ Qualitative Results
-
Discussion ◦ Classification Network Performance ◦ Segmentation Network Performance
- Suggested Modifications
- Conclusion
- GitHub code
- References
Introduction
Surface defect detection is an essential task in the manufacturing process to ensure that the end product meets the quality standards and works in the way it is intended. A common property of these surface defects is that their visual texture is inherently different from the defect-free surface [2]. That is why visual inspection systems are used for detecting these defects. The manual task of looking at objects and finding those anomalies is difficult and tedious. The appearances of these defects such as cracks, dents, smudges, and impurities can differ in terms of pixel intensities, geometrical constraints, and visual appearance as a whole [1]. Traditional handcrafted or engineered features work on specific textures, however, they are challenging to create and don’t generalize to other tasks. With the staggering advancements in deep learning, CNNs have emerged as great tools for different types of computer vision tasks such as classification, segmentation, regression, object detection, etc.
However, the following challenges exist that makes the use of CNNs challenging:
- The number of parameters in the state of the art methods is generally in the order of tens to hundreds of millions. This forces the requirement of having a large GPU or a cluster of GPUs for training these models.
- If you have a small amount of data larger models can easily overfit thereby leading to the model just memorizing the training data and performing poorly on actual test data.
- Normal segmentation architectures require pixel-level annotations that are time-consuming, costly and tedious to create. A large training set is almost always required in deep learning for training. But for detects/anomalies, these instances are rare by definition, which means we have a small set of examples.
In this paper, the authors propose a CNN architecture (relatively compact at 1.1 M parameters) that outputs an anomaly segmentation mask and a classification score in textured surfaces from weakly annotated data which addresses the above problems to some extent. Let us now look at the overall methodology used in the paper.
Methodology
The authors proposed the following network architecture for anomaly detection.
![Fig.1. The network architecture consisting of the segmentation and classification subnetworks. Their corresponding output is a 128x128 mask and a regression value between 0 to 1 [1].](https://towardsdatascience.com/wp-content/uploads/2020/12/1-THFWVVVrUKJeQZkj5ZjJg.png)
It comprises of two stages namely, segmentation and classification. The classification score is a value between 0 to 1 and represents the confidence score of the network that the input sample has a defect or not. The segmentation network outputs a heatmap mask.
Segmentation Network
It has three convolutional blocks consisting of three convolutional layers each. The number of features increases by a factor of two in each convolutional block. However, the filter size in each block decreases. It uses a progressively decreasing filter size of 11×11, 7×7 and 3×3. I know, 11×11 kernels are out of fashion since AlexNet. However, the authors argue that this large kernel size is required for their network to be able to enclose part of the anomaly. And the subsequent filter sizes are such that the ratio between the subsequent filter sizes is kept the same. The network input is 512×512 and the SegLayer (that comes after the 3 convolutional blocks and uses a 1×1 convolutional kernel to output a single channel mask) output is 128×128 which is sixteen times lower than the input image resolution. This subnetwork is fully convolutional as it has no pooling or fully connected layers. It is the major component of the architecture and the classification network relies completely on it. Even though the authors argue that the increase in stride and reduction in kernel size leads to the number of parameters being under control, I found interesting results which I’ll present in the discussion section.
Classification Network
This part heavily relies on the segmentation network to perform properly in terms of finding the hotspots related to the anomalous regions. It uses the SegLayer output and additional feature vector derived using a convolutional layer. Then global statistics are calculated using max and average pooling operations. These features are concatenated and passed through what the authors call an S-neuron i.e. 1×1 kernel that outputs a single channel.
Architectural Specifications
- All the layers in the network use ReLU activation followed by Batch Normalization layer.
- The only exception is the SegLayer and S-neuron layers which use linear and sigmoid activation respectively.
- All network weights are initialized using a standard normal distribution.
Now that we know about the architecture, let us move to the experimental setup.
Experimental Setup
The network details are mentioned in Fig. 1. The authors propose a two-stage training approach.
- Segmentation stage: In this stage, only the segmentation network is trained for 25 epochs. All the classification network layer weights are frozen during this stage.
- Classification stage: In this stage, only the classification network is trained for 10 epochs. This is performed after the segmentation stage training. During this stage, the segmentation layers are frozen. This sequence of training and layer freezing is very important since it ensures that the classification network would be trained on meaningful segmentation representations.
Loss functions
1. Segmentation stage: The mean squared error is used as the loss function given by the following equation.

where n denotes the number of examples, p is the number of pixels, xᵢ the annotated pixel value and x-hatᵢ the predicted pixel value.
2. Classification stage: The standard binary cross-entropy loss is used in this stage. It is defined by the following equation.

where n denotes the number of examples, yᵢ the ground truth and ŷᵢ the regression output.
Optimizer
Adadelta optimizer was used with default settings as suggested in the paper [3].
Dataset
DAGM[5] is a synthetic dataset for industrial optical inspection and contains ten classes of artificially generated textures with anomalies. The authors conducted experiments on all 10 classes. It comprises of diverse surface classes with diverse anomalies emulating cracks, dents, smudges, and impurities, each generated by a different texture and defect model and is shown in Fig. 2. For every image a weakly labelled annotation in the form of an ellipse covering the entire defect is available. This is a coarse annotation that includes defect-free pixels. Since the ellipses cover a significant amount of normal texture in addition to the defect it makes the detection task challenging.
![Fig. 2. Sample images from the DAGM dataset. Each surface exhibits an intra-class variation of the background texture. Red ellipses present the coarse surface anomaly labelling, i.e., weakly labelled ground truth annotations as these include areas which do not correspond to anomalies [1].](https://towardsdatascience.com/wp-content/uploads/2020/12/1jnl-TCHzDDHYEvBVr51QQ.png)
The authors refer to a given example as positive if it contains an anomaly and negative otherwise. The train and test distributions of the dataset are shown in the following table.
![Table 1. Distribution of train and test examples over the dataset [1].](https://towardsdatascience.com/wp-content/uploads/2020/12/1dXE3TmozzSzVVhvQ66gTuQ.png)
Training setup
They conducted experiments in four different types of setup which are explained in the following table.
Table 2. Different training setups used in the paper [1].
The augmentation is performed by applying a 180-degree rotation and flipping around the horizontal and vertical axes. Thus there are 3 more samples per input image leading to a four-fold training set.
Results
The authors performed a quantitative evaluation of the classification score and a qualitative evaluation of the segmentation output.
Quantitative Results
Evaluation Metric: They use area undercurve (AUC) measurement of the receiver operating characteristics (ROC) [5] as an evaluation metric. AUC or AUROC is a reliable measure of the degree or measure of the separability of any binary classifier (binary segmentation masks in this case). It provides an aggregate measure of the model’s performance across all possible classification thresholds. An excellent model has AUROC value near to the one and it means that the classifier is virtually agnostic to the choice of a particular threshold.
The results are documented in the following figure.
![Fig. 3. Receiver Operating Characteristic (ROC) curve for each surface class as shown in Fig. 2 [1].](https://towardsdatascience.com/wp-content/uploads/2020/12/14lP8mBn7o0EnkBJBo6Weqw.png)
"The curves are obtained by thresholding anomaly score predictions from the testing set for each given example from 0 to 1, and classifying an example as positive if the score prediction exceeds the threshold. The numbers below the curves indicate the area under the curve (AUC). The figures indicate the likelihood that an example with an anomaly will be classified correctly (True positive rate) vs. the likelihood that an example without an anomaly will be classified falsely (False positive rate) when learning on different training setups. [1]"
Qualitative Results
The qualitative results for sample images showing the segmentation outputs along with the classification score are shown below.
![Fig. 4. Segmentation outputs for a few samples along with the classification score value [1].](https://towardsdatascience.com/wp-content/uploads/2020/12/1RsalwwMeyHxWN1YwO-FiNA.png)
Discussion
I discuss the classification network followed by the segmentation network.
Classification Network Performance
We see that the classification scores are out of the chart since the AUROC values are virtually 1 everywhere in Fig. 3. This is an impressive classification performance. However, to test out how a simple CNN architecture with few layers would perform on this dataset, I conducted experiments. The network was just a bunch of CNN layers stacked one after the others. And to my surprise, I was able to get the same AUROC values of 1. This indicated that while for traditional handcrafted features DAGM is a challenging dataset for classification, for CNNs it is an easy dataset.
Segmentation Network Performance
As can be seen in Fig. 4, the segmentation outputs of all the training configurations expect the PosNeg-aug, the segmentation output is miserable. The models trained in PosNet-aug configuration seem to be adept at finding the anomalies. However, it overfits to the weakly labelled annotations and provides unprecise output masks that are over dilated and covers a lot of non-anomalous pixels similar to the training data. This can be attributed to the unnecessarily large number of filter per convolution block. The number of filters progressively increase from 32 to 64 to 128 across the three convolution blocks. This leads the intermediate features to have extremely high dimension. As a result, the GPU training requires a lot of memory since all these intermediate features and their gradients need to be stored and is slow.
Suggested Modifications
I did a thorough evaluation of their approach and did the following modifications to improve the segmentation network and remove unnecessary limitations imposed on the network.
- Since the segmentation network is fully convolutional instead of fixing the input size to 512×512, I used a HeightxWidthx1 input i.e. the input dimensions are inferred during training/inference.
- The linear activation proposed in the paper leads to poor performance. Since we just have two classes, I used a tanh activation instead to be applied to the SegLayer. It restricted the output to [-1, 1] and led to better separation of the anomalies from the texture.
The results are documented in my thesis [2].
Conclusion
The network proposed in this paper is a very good starting point for weekly supervised anomaly detection, since when I was conducting research, their paper seemed to be the only CNN based approach available for this. The network seems to be able to learn to output segmentation masks similar to the weak annotations.
However, it has a few shortcomings as follows.
- Even though the model is fully convolutional, the input size is fixed thereby making the model incapable to handle varying size inputs.
- The network does not learn to detect the actual shape of the anomaly from the weak labelling since it overfits to the training data.
- The segmentation output is sixteen times smaller than the input image dimensions. This can introduce errors in localization of the anomaly and in the calculations of metrics which are based on the shape and size of the defect.
- The network is not tested on any real-world dataset that raises concerns regarding its practical application.
After extensive research, I proposed an architecture called AnoNet which addressed all these shortcomings [6]. The paper "AnoNet: Weakly Supervised Anomaly Detection in Textured Surfaces" is available at https://arxiv.org/abs/1911.10608. I’ve proposed a novel filter bank initialization technique that leads to faster training (i.e. training for fewer epochs). It learns from as few as 53 images to detect the actual shape of the anomalies, is extremely compact with just 64 thousand parameters (i.e. a reduction of 94% w.r.t. this network) and is tested on 4 datasets with diverse types of anomalies/defects.
Thank you for reading my article, hope you enjoyed it! I would love to connect via LinkedIn.
GitHub code
I’ve made the code of my implementation of the above paper available in PyTorch and Keras here.
https://github.com/msminhas93/CompactCNN
References
[1] D. Racki, D. Tomazevic, and D. Skocaj. A compact convolutional neural network for textured surface anomaly detection. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1331–1339, March 2018.
[2] Manpreet Singh Minhas (2019). Anomaly Detection in Textured Surfaces. UWSpace. http://hdl.handle.net/10012/15331
[3] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
[4] T. H. Matthias Wieler, "Weakly supervised learning for industrial optical inspection," https://hci.iwr.uni-heidelberg.de/node/3616.
[5] Charles X. Ling, Jin Huang, and Harry Zhang. Auc: A statistically consistent and more discriminating measure than accuracy. InProceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI’03, pages 519–524, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc.
[6] AnoNet: Weakly Supervised Anomaly Detection in Textured Surfaces, arXiv:1911.10608, Available: https://arxiv.org/abs/1911.10608