These days tons of automotive data is collected every day, but how do we identify relevant samples in our data pool and which samples are worth labeling?
After training a 2D object detection algorithm with the Audi Autonomous Driving Dataset (A2D2) [[1](#bb62)3], we observed that the algorithm does not detect vehicles below highway traffic signs during inference. To improve the algorithm, typical active learning methods suggest strategies such as pool-based sampling to enrich the training dataset [1]. Since active learning denotes an iterative algorithm, this might be inefficient for big data.

We therefore propose an efficient technique to select samples from big data even before labeling: Content-based image retrieval is an efficient and fast method to enrich your training dataset with samples that are likely to best improve your algorithm.
Content-Based Image Retrieval
Content-based image retrieval (CBIR) is also known as query by image content (QBIC) and was first introduced in the 1990s. With an increasing volume of image data, the demand for efficient image retrieval systems has grown and become more relevant than ever. In the early years, CBIR systems were based on feature extractors, such as Discrete-Cosine-Transform [2] and Scale-Invariant-Feature-Transform [3]. Today, most of the CBIR systems rely on feature encodings based on deep neural network. However, the underlying process of content-based image retrieval, illustrated in Figure 2, has not changed dramatically over the years.

The image retrieval process is composed of a feature extraction block, a search engine and a database. The feature encodings of all images in the database are computed in advance and stored in the feature database. The query image is propagated through the feature extraction block on demand. The search engine compares query encoding against all encodings in the database and retrieves the best matches. In order to encode content-based information in images, many CBIR systems [4] rely on pretrained neural networks for object detection. Common examples are ResNet50 [5] and VGG16 [6]. These feature encoders are trained with labeled data, often on ImageNet [7]. During inference, they are applied to data that might not originate from the training data domain. This can lead to a lack in performance caused by what is known as the domain gap [8].
Unsupervised Content-Based Image Retrieval
The proposed image retrieval system consists of a feature encoder, a pooling block and a cosine-similarity-based k-nearest neighbors algorithm. In the following, we will revisit each block in detail.

Feature Encoder
To overcome the domain gap we propose a feature encoder that can be trained with any unlabeled dataset. Recent research has shown that self-supervised feature encoders as proposed in "Contrastive Predictive Coding" (CPC) [9] or "A Simple Framework for Contrastive Learning" (SimCLR) [10] are capable of reaching the performance of their counterparts that are trained under supervision in terms of downstream classification tasks on ImageNet. Both frameworks show similar performance, so we chose the simpler framework, SimCLR, as the feature extractor for image retrieval.

The SimCLR **** includes a fully convolutional encoder, two augmentation stacks and a projection head as illustrated in Figure 4. The two augmentation stacks transform N input data into two different representations. Both augmentation stacks stem from the same family of augmentations. The encoder extracts representation vectors from the 2N augmented representations. Finally, the 2N representation vectors are propagated through the projection head. The following loss term encourages the encoder to map both augmented representations to the same representation vector: Suppose s_i_m(u, v) denotes the cosine similarity between the vecto_r_s u a_n_d v. Then the loss function for a positive pair of example vectors (i, j) is defined as:

λ denotes the temperature parameter, z stands for the representation vector of the projection head. The loss can be interpreted as a classification in the latent space. The positive example pair must be classified correctly in a set of 2(N-1) negative samples from the same mini-batch. Note that, the information encoded in the representation vectors strongly depends on the set of augmentations. Typically, you augment input data by different color and noise transformations, or by flipping or cropping data to implicitly create an invariance in the algorithm by presenting the same input, but augmenting it differently over several epochs during training. The framework we present forces the invariance to augmentations in its loss term, explicitly. Thus, these simple augmentations are powerful tools that help control the flow of information into the representation vectors. Due to this effect, the final representation of the projection head is almost color invariant. Since the hidden features h still contain color information [10], these features are used for the image retrieval system.
Pooling
To gather translation invariant features from the feature extractor, CBIR systems apply a certain kind of pooling layers to the extracted feature map. The pooling layers can be either simple, such as mean or max pooling, or more complex, such as R-MAC [11] or SCDA [12]. Here, we focus on selective convolutional descriptor aggregation (SCDA). The SCDA method is an intuitive and fast aggregation of convolutional feature vectors. An aggregation map A is obtained by summing up all N feature maps. A binary mask M is calculated as follows:

where α denotes the average of A. Originally, the representation M is further post-processed to eliminate the interference caused by noisy parts. A flood-fill algorithm is used to detect and select the largest connected part of the image. This step is used in SCDA to find the main object in the image. Since the algorithm is applied to more complex scenes from the automotive context with more than one central object, this step is skipped here. However, the representation M is used for masking of the feature map. Finally, global average pooling is applied to the masked convolutional features.
User Feedback
We must distinguish between the interpretation by human beings and neural networks concerning the main features of images. Often, these two have only little in common. Therefore, the proposed CBIR system allows for manual interaction with the search engine. The user can provide feedback to the search engine by selecting the most fitting results of an initial search trial. For a second trial, the query and the selected representation vectors are aggregated and used as the new query representation. Assume we are searching for a specific truck on a highway. The system provides many results containing vehicles on wide roads. We receive only one satisfactory result presenting a truck on a highway. By averaging the features of the query and the satisfactory result, the feature description of the truck is strengthened.
Qualitative Experiment
The augmentation stacks of the SimCLR contain random contrast, saturation, brightness and hue transformation as well as random crop, rotate and flip left/right. The combination of random cropping and random color distortion is important, since random crops from the same image often yield a similar color distribution, a model could simply maximize the agreement by matching color histograms. Note that, by designing the augmentation stacks you define the concept of image similarity, because the cosine similarity sim(u, v) is a main part of the Contrastive Loss. Although large batch sizes are beneficial for contrastive learning, we are limited to using the batch size N=64 on a single RTX 8000. Returning to our initial example, the undetected truck below highway traffic signs. We have trained a Yolo-V3 detection algorithm with the A2D2 [13] Dataset. The 2D labels are extracted from the semantic instance segmentation.
As can be seen in the video in Figure 1, the trained object detection algorithm does not detect trucks below the highway traffic signs. We assume that the detection algorithm has not seen a sufficient number of contexts like highway traffic signs during training. To enrich the training dataset of the Yolo-V3 model, we search for similar images in an unlabeled dataset. Then, these images are labeled, augmented and added to the training set. The Yolo-V3 is retrained on this dataset. Augmentations such as saturation, grayscale, hue, and horizontal flipping are assumed to be beneficial without further evaluation. The following 16 results in Figure 5 are found efficiently and fast with the proposed CBIR system with user feedback in big data.
Note that, the sequence of the problem statement itself is excluded, since it is not valid to add test examples to the training set for validation purposes.

This process resembles pool-based sampling in active learning [14]. The similarity measure can be compared to an informative score. Concerning regular active learning approaches all images from the database must be propagated through the model to obtain the informative scores. Since active learning is an iterative algorithm, this must be repeated several times until the model converges. In our approach, the images from the data lake are propagated trough the CBIR feature extractor once when adding new images to the database.
Finally, validating the retrained model on the same sequence again shows the following results:

Summing up, we have found relevant samples in big data with the proposed unsupervised content-based Image Retrieval **** system that help improving the 2D object detection algorithm.
Daniel Hasenklever, Research Engineer at the dSPACE AI-Team
References:
[1] O. Cohen, "Active Learning Tutorial", 2018.
[2] G. Strang, "The Discrete Cosine Transform", 1999.
[3] D. G. Lowe, "Object Recognition from Local Scale-Invariant Features" in Conference on Computer Vision, Corfu, 1999.
[4] B. Hu et al., "PyRetri: A PyTorch-based Library for Unsupervised Image" in International Conference on Multimedia, 2020.
[5] K. He et al., "Deep Residual Learning for Image Recognition", 2015.
[6] K. Simonyan et al., "Very Deep Convolutional Networks for Large-Scale Image Recognition" in ICLR, 2015.
[7] Dataset "ImageNet"
[8] Apple Blog "Bridging the Domain Gap for Neural Models", 2019
[9] A. v. d. Oord et al., "Representation Learning with Contrastive Learning", 2019.
[10] T. Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations", 2020.
[11] G. Tolias et al., "Particular Object Retrieval with integral max-pooling of CNN activations" in ICLR, 2016.
[12] X.-S. Wei et al., "Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval", 2017.
[13] "Audi Autonomous Driving Dataset," Audi
[14] Wikipedia Article "Active Learning"