Multi-View Image Classification

From Logistic Regression to Multi-View Convolutional Neural Networks (MVCNN)

Samy TAFASCA

Published in

Towards Data Science

15 min readDec 24, 2019

Vintage Car Design. Credit © Publicdomainphotos | Link

Introduction

Not long ago, I took part in a machine learning hackathon hosted by Daimler-Benz. The problem we were presented with was rather interesting and not so common. So I decided to write an article about it, in case my approach(es) can help someone else faced with a similar task.

In the first chapter of this article, I will try to present my winning approach, walk you through my thought process, justify some design choices and elaborate on some concepts of interest. The second chapter, on the other hand, will be dedicated to a more sophisticated neural network based solution that is inherently adapted to this kind of problem. At the bottom of the page, you can find the link to my Github repository where I share the code to support the information presented in this post.

The problem we had to solve was to classify 3D models of Car Plugs in one of the five categories available, based on 2D visual information. Solving this task is of great value, especially in the context of autonomous planning of assembly processes for car manufacturing.

The most interesting aspect of the problem was the availability of 8 images for each Car Plug, 6 of which correspond to orthographic projections of the object (i.e. different views : top, bottom, front, rear, left and right) while the 2 others are random isometric projections. Figure 1 below shows an example of orthographic projections.

The Challenges

There were three main challenges in this contest, and scoring well rested on identifying these obstacles and addressing them properly. The first of which, and probably the most obvious one, is a case of class imbalance where two classes are a lot less common in the dataset than the three others. It was important to take this into account, because the performance metric used in the competition was an extremely uneven weighted accuracy that factors in the distribution of the classes (i.e. rare classes are comparatively more important to classify correctly, sometimes by a factor of 80). The second challenge was that we had a timeframe of five hours to code everything up, without access to the cloud or any other means of accelerated computing (unless our personal laptops came equipped with Nvidia GPUs, which most weren’t). Lastly, the availability of multiple images for each object to classify posed the question of how to combine the visual information gathered from all 8 images in order to make a well informed prediction. This last challenge is the main motivation behind this post.

For the problem of class imbalance, resampling techniques were not a good option. Indeed, downsampling was practically impossible as we only had 833 Car Plugs in total (totaling 833x8 = 6664 images) and the rare classes had a very low number of samples. Furthermore, using data augmentation, though technically possible, did not seem like a good idea for me, simply because orthographic projections that produced the image views are a type of engineering drawing with a strict set of rules. Consequently, applying any sort of geometric transformation would result in images that do not belong to the original distribution of the data : using those for training is basically like pushing your model to learn how to predict images that it will never face, thus unnecessarily increasing the complexity of the problem. It is also worth noting that colors have no meaning in our images. To get a better sense of this, Figure 2 below provides an example of the different raw images of a Car Plug.

Figure 2 : Different views of a Car Plug (code name anonymized)

The way I decided to address the class imbalance problem is through Cost-Sensitive Learning, which basically means modifying the cost function of the machine learning algorithm by introducing a set of weights, to assign heavier penalties for mistakes that occur in “important” classes. This change helps to steer model training towards making less of those costly mistakes at the expense of potentially making more of the “cheaper” mistakes. To illustrate this point further, imagine we have 50 data points, 49 belonging to class A and 1 belonging to class B. For the sake of simplicity, we also assume the 0–1 loss (0 for a correct prediction, 1 otherwise) instead of the prevalent cross-entropy loss. Under normal circumstances, where the weights of all the classes are 1, the model will try to learn how to predict most of the data points correctly, irrespective of their class. But if we assume that misclassifying a sample of class B bears x100 the cost of that of class A, and we modify our cost function to take this into account, then the model will end-up choosing to classify that one data point of class B correctly, even if it meant failing in the prediction of all other 49, because that is the setting that minimizes the overall cost: 49*1 + 0*100= 49. Instead, if we made a mistake in the class B data point, and we got all other class A samples right, the cost would be: 49*0 + 1*100 = 100 which is still much higher than the previous case. Using cost-sensitive learning in practice is as simple as passing in a weight dictionary to the algorithm constructor (those that support it at least).

For the time constraint and absence of suitable computational resources, I decided to stay away from any neural network-based approaches, simply because it would take too much time to configure properly and train on a CPU. Instead, I opted for a more traditional computer vision solution, combining popular hand-engineered feature extractors with classical machine learning algorithms. And this, folks, is the tale of how a Logistic Regression won the 1st prize in a computer vision hackathon.

First, we will introduce the approach I used during the hackathon. Then, we will present a potentially better alternative based on a neural network architecture that can inherently handle the multi-view problem. I had thought of this second solution as well during the contest, but I knew I wouldn’t be able to implement it given the time and resource restrictions. However, I tried it afterwards, and it provided better results as expected.

Image Processing

As we can see in Figure 2, there are a couple of issues we need to deal with before feeding these images or their vector representations to any machine learning algorithm. First, we need to get rid of the unnecessary visual artifacts on the right and top sides (search bar, menu, etc.). Second, we need to get rid of the shades of gray in the background which should speed up the training significantly by decreasing noise levels in the data.

It turns out, we can do both of these things using a simple process: we first detect edges using the Canny Edge Detector, then we apply two successive morphological operations: Dilation to make the edges larger, and connect the parts that got disconnected after edge detection, then we use Erosion to thin the edges back to normal (a dilation followed by an erosion is usually called a Closing, and is used to fill holes that are smaller than the size of the kernel of those operations). Afterward, it is just a matter of finding all the contours and keeping the largest one, which will correspond to the Car Plug in the middle. This way, we also discard all the extra artifacts around the borders. Figure 3 below shows images of a Car Plug before and after this process.

Figure 3 : Preprocessing of images : Canny edge detection followed by two morphological operations (Dilation and Erosion), followed by a contour detection to isolate the object in the middle.

At this point, we may be thinking since color has no meaningful value, we might as well convert these images into grayscale. In the first approach based on classical machine learning algorithms, we will do just that. But for the following neural network-based one, we keep the 3 channels since we will make use of transfer learning.

Approach 1: Feature Extraction & Machine Learning

Before the era of deep learning, researchers used to manually craft feature extractors to characterize images. But what do we mean exactly by a feature extractor? In essence, it is a function that extracts a numeric representation of an image or a part of it and is designed to capture some of its distinguishing characteristics. Imagine we had an RGB image, consisting of three 2D matrices for each of the color channels with values between 0 and 255. We could quantize each of these matrices, into say 4 bins (0–63, 64–127, 128–191, 192–255), where each bin contains the pixel values that are within the associated range. This way, we can represent a 2D matrix with a vector of 4 numeric values (x_1, x_2, x_3, x_4) where each x_k is the number of values in the matrix, that fall within bin number k. We can repeat this process for the 3 channels of the RGB image, and after concatenation, we would end-up with a vector of size 12 (4 bins * 3 channels) representing this image. We just created a simple feature extractor that captures color information of an image. Although simple in nature, you can actually use such technique for tasks where color is the most informative factor. Figure 4 shows this feature extractor in action. Notice how the higher bins of each image’s dominant color are more distinguished in the feature vector. This representation definitely captures the differences between the three images.

Figure 4 : Example of a color feature extractor.

The academic community came up with a variety of manually designed feature extractors for different purposes over the years. A few important ones are HOG (Histogram of Oriented Gradients), SIFT (Scale Invariant Feature Transform), SURF (Speeded-Up Robust Features), LBP (Local Binary Patterns) etc. The main focus in the design process of these extractors, is to ensure robustness to certain properties of the images : scale, illumination, partial occlusion, rotation, distortion etc. These methods were a central pillar in traditional computer vision and lasted until the advent of deep learning, where they were essentially replaced by neural networks that can learn their own feature extraction mechanism in relation to the task they need to solve in an end-to-end manner. Unsurprisingly, this helped them yield superior results overall.

Alright, enough chit-chat! So how can we use these feature extractors to solve our problem? First, we will discuss how they were traditionally used to solve standard image classification problems, then we shall present how I adapted this process to the multi-view task.

Most Feature Extractors operate by first detecting key points of interest in an image, then producing a vector description for each key point. The descriptors are usually numeric vectors of a fixed size.

Descriptors from all images are collected, and then clustered to come up with a set of prototypes (i.e. centroids) that act as a “Vocabulary of Visual Words”. Then, we perform image quantization: we find the visual words in each image, and create the count feature vector, similar to how we would do it in the Bag Of Words approach typically encountered in text classification tasks. The steps of the process are detailed next :

Extract all features from all “training” images (potentially many features per image), after converting them to grayscale, into a matrix of shape FxM where F is the number of features found and M is the dimension of the feature extractor (typically 64 or 128).
Cluster the FxM matrix using a clustering algorithm (e.g. K-means) where K, the number of clusters, will represent our vocabulary of visual words found in the entire dataset. Each cluster is then a visual word.
For each image, extract all the features, predict their clusters, and count how many features fall under each cluster. In other words, we count the occurrence of each visual word of the vocabulary (also called codebook) in the image. The result is a feature vector of size (1, K) representing the image. Applied to all the images in the dataset, we end up with a matrix of shape (N, K) where N is the number of images and K is the size of the vocabulary (i.e. number of clusters). In other words, a matrix where rows represent our images (data points) and columns represent the attributes or features. Each cell of this matrix represents the number of features found in the row image, that belong to the column cluster.
Use any classical machine learning algorithm on this matrix for classification: Logistic Regression, SVM, Random Forest, MLP, etc. That said, tree-based methods should be avoided in such situations because the problem is likely linear as it is the case in most text classification tasks.

Because a picture is worth a thousand words, let us further explain this process through a simple flow diagram. Notice the keypoints detected (colored circles) on the grayscaled images after Step 1.

Figure 5 : Flow diagram of the process of converting a set of images into a feature matrix where each row is an image and each column is an attribute or feature.

Armed with this knowledge, we can come up with a way to adapt this approach to our multi-view classification problem. The change we need to introduce is actually very straightforward, and happens at step 3: Instead of extracting features from an image and turning them into a bag of words feature vector, we will extract features from all 8 images, then turn them into a single feature vector representing the Car Plug. If we were to apply the same approach to a multi-view text classification task (where a concept consisting of multiple texts needs to be classified), this would amount to combining the words found in all the texts related to the given concept, and then building the word occurrence feature vector. Figure 6 visualizes the process.

Figure 6 : Combining features extracted from all 8 view images into a single feature vector for the Car Plug.

Alright, now let us talk about some practicalities. Most of my decisions were made in an effort to optimize for both time and performance. In terms of feature extractors, SIFT is a pretty solid choice, but I actually used SURF which is basically a faster version of SIFT with relatively similar performance. For the dimension of SURF features, I chose 64 instead of 128 mainly to speed-up any training further down the line. I expect 128 to provide slightly better performance though. The resulting matrix had about 800K features of size 64. Choosing the number of clusters K in K-means was the next important decision. This is because K represents the number of visual words that will make up the columns of the final feature matrix. Oftentimes in such situations, the higher the value the better, but the law of diminishing returns applies, and past a certain point, the marginal improvement will not be worth the complexity increase. In my case, I opted for K=200, but values up to 800 are standard and should be experimented with. After this point, I just created the final feature matrix of size 833x200 (number of car plugs x number of clusters). Figure 7 shows two examples of feature vectors and one randomly selected image view for each plug.

Figure 7 : Two examples of key-point detection on a single view and the final feature vector of the associated Car Plug.

After training a Logistic Regression model, I ended up with around 87% of the weighted accuracy on average (the unweighted accuracy is easily in the 90–95% range). This is a pretty decent score, but can we do better? We can definitely try!

Approach 2: Multi-View Convolutional Neural Networks

This section is devoted to a custom neural network architecture to handle the multi-view problem. If we step back and take a second to think about the problem, how would we use a neural network to map our inputs to their class? Well, in a standard image classification problem, we’d select the base of a popular CNN architecture (ResNet, Inception, VGG, etc.) as a feature extractor, then add a classifier block on top of it, consisting of one or multiple dense layers as a form of transfer learning. If we tried to use a similar principle in our multi-view image task, we could imagine creating the base block of the network (feature extractor) multiple times, one for each image view. This way, we extract features from all of the 8 images separately using transfer learning. However, we will still be left with the same question we had before: instead of having to combine the information from the images at the beginning, we delayed the matter, and now we have to combine the information from multiple sets of feature maps (i.e. the feature maps representing each view produced by the base CNNs). One way to deal with this problem is to stack those feature maps together. For example, if each view’s feature maps had a shape of KxKxC, we could stack the 8 of them along the dimension of channels into one big set of feature maps of shape KxKx8C. Figure 8 illustrates this process.

Figure 8 : Combining views’ features maps by stacking them across the channel dimension.

The problem with this approach, however we choose to stack, is that it assumes there is an order to the image views, when in fact there isn’t one. If we stack the feature maps of the top view first, and bottom view last, then we are expected to do the same thing for each and every Car Plug. However, there is no reference specifying what is top for a Car Plug and what is bottom, we can consider any view to be any side. Or said otherwise, if we take two Car Plugs, and one view of the first one, there is no way to find the “equivalent” view in the second Car Plug. Thus, any assumption based on order should be avoided.

We are left with a second option, to find an operation that combines information from multiple values into one. In turns out, in CNNs there is one such function, and we call it Pooling. We could use some form of Pooling across the dimension of Views, to combine 8 sets of feature maps into one of the same shape, thereby combining information from all image views. This is all good and well, but it’s just a theoretical idea so far, is there any evidence that this can actually work ? Well, one such proof can be found in text classification, where we combine word embeddings of every word in a text into a single embedding by taking the average of all embeddings for example. We are trying to do something similar for our image views here. But perhaps, a better proof would be to find a paper that attempted such approach and that provides experimental results supporting it. A quick google search will reveal that there is one such paper (not the only one though), and it’s conveniently called : Multi-View Convolutional Neural Networks for 3D Shape Recognition. Figure 9 shows the architecture proposed in the paper, which is almost exactly what we explained above.

Figure 9 : Each object view is fed to the same CNN1 base, then we apply a pooling operation across the view dimension. The pooled view is then fed into a second CNN2 before being classified. The authors use the second CNN in order to learn a compact representation for the 3D shape, which they use for both classification and retrieval. For a simple classification, we can do away with it [1].

I implemented this MVCNN architecture in PyTorch and used a ResNet34 Base for the CNN1. After an Average View Pooling layer, I added a block on top, consisting of a series of Dense layers and Dropout regularization which I call the Classifier Block. For training, I initially fixed the base CNN1 and only trained the Classifier Block: this is typically called feature extraction in transfer learning. Then, I unfroze all the weights in the entire network, reduced the learning rate to a very low value, and trained the whole MVCNN end-to-end: this is referred to as fine-tuning in transfer learning. Ultimately, my weighted accuracy was hitting the 98% mark … pretty impressive, huh? Not so much actually. Aside from the weird performance metric and class imbalance, high scores should be expected because this task is arguably not very challenging for machine learning.

Yeah, yeah I know you’re probably here just for the code, and I’m happy to oblige! You will find everything you need in this github repo: code. The notebook is well commented, but unfortunately, I cannot share the data. If you’re keen on experimenting with the techniques proposed here, and you don’t have a proper dataset, think of using the benchmark for multi-view image classification: ModelNet40. If you’re more interested in the first approach, any normal image classification dataset would do.

That’s it for me folks, I hope you learned a thing or two. Don’t hesitate to come back if you enjoyed the content, I’ll be pushing more articles soon.

Until then, sayonara ~

References

[1] Su, Hang, et al. “Multi-view convolutional neural networks for 3d shape recognition.” Proceedings of the IEEE international conference on computer vision. 2015.