A survey of Computer Vision in fine-art classification

Jonathan C.T. Kuo
Towards Data Science
10 min readMar 20, 2021

--

Image from Cetinic et al. (2018)

Overview

With the emergence of online galleries and online fine-art markets, digitized fine-art painting collections have become demanding. Google Arts & Culture is an excellent example of an educational and recreational online platform that brings online fine art more accessible to people — letting people even immerse in the virtual art gallery through its VR technology. Another example is Artsy, which is an online fine-art brokerage. It utilizes a search system to link works of art based on their relationships with each other. In addition, many other applications require the techniques of fine-art classification. Therefore, the enhancement of the capabilities in the search algorithms of digitized arts and the process of documenting and managing this cultural heritage remains essential and meaningful. However, there are at least a few challenges that remain to be resolved to address this problem.

Problem

A significant challenge is the shortage of large datasets of labeled digitized artworks. Without substantial data, even a recognized classification model will not yield a good classification result. The designs of the classification models and the implementations of model training are equivalently critical as well. Therefore, in the following sections, I will first review a data augmentation method that tries to overcome the lack of labeled fine-art data proposed by Smirnov & Eguizabal (2018). Together with their approach to refining transfer learning on some renowned CNNs (Convolutional Neural Networks). Next, a novel approach that uses image patches and transfer learning to classify fine-art images introduced by Rodriguez et al. (2018) will be discussed. Finally, I will compare the aforementioned works with the design proposed by Cetinic et al. (2018). They discovered that instead of focusing on object detection for the classification task, scene recognition and sentiment prediction could generate better results for the job.

Dataset

The dataset being used in the three above mentioned papers are:

  • PASCAL VOC
  • Painting dataset
  • TICC Printmaking Dataset
  • Web Gallery of Art (WGA)
  • WikiArt

PASCAL VOC includes 20 classes of annotated images taken by the cameras without further stylistic transformation for object recognition and segmentation tasks. The Painting dataset contains 10 classes of fine-art paintings with the labels from the PASCAL VOC. TICC Printmaking Dataset is a dataset of digitized photographic artwork reproduction on prints made on paper. WGA comprises fine-art paintings from the early 3rd to 19th centuries. Apart from WGA, WikiArt is one of the largest online digitized painting collections, including more than 250000 artworks by 3000 artists from the 19th to 20th centuries.

Algorithms and Methods

Data augmentation

Compared with some of the most well-known datasets, such as ImageNet (14 M+ images with 20000 categories) and Tencent ML Images (~18M images with 11166 categories), there is much less data for the fine-art paintings. Fortunately, a group of researchers has found a way to tackle this problem.

Smirnov & Eguizabal (2018) proposed applying style transfer for real-world images obtained from PASCAL VOC. The concept combines (or multiply, in a more technical term) natural images from PASCAL VOC and images with specific artistic styles, such as Impressionism or Realism. Thus, it allows a wide variety of new images to be generated with the same content (object) but with different artistic styles or textures. An example of the process is shown in the below figure.

Figure 1. An example of the style transfer process. Image (e), (f), and (g) has been applied style (b), (c), and (d) respectively. All result images present the same content of the image (a) but with unique artistic styles. Image from Smirnov & Eguizabal (2018)

Transfer learning with convolutional neural networks (CNNs)

To benefit from the valuable features obtained from some pre-trained CNNs and save computational cost, all three papers adopt transfer learning methods on different CNN models to classify fine-art paintings.

1. CaffeNet and its variations

Cetinic et al. (2018) used the base architecture of CaffeNet, which is a slight modification of AlexNet. It is a shallow network with only eight layers. Five Three convolutional layers followed by three fully connected layers. Also, there are three max-pooling layers after the first, second, and fifth convolutional layers. They chose ReLU as the activation function for all layers, and the output layer is connected to a softmax layer. The figure below depicts CaffeNet’s architecture.

Figure 2. CaffeNet’s architecture. Image from https://medium.com/coinmonks/paper-review-of-alexnet-caffenet-winner-in-ilsvrc-2012-image-classification-b93598314160

Since transfer learning aims to take advantage of the pre-trained weights, the authors reserved the weights of the first seven layers but the last of the pre-trained CaffeNet. Instead, they replaced the last fully connected layer with a specific number of neurons corresponding to the number of targeted classes in their dataset. They also attempted to test different combinations of the frozen layers (meaning the settings and weights are kept the same as the pre-trained model) to evaluate whether they will affect the classification task's final performance. They believe that the first few CNN layers were to discover general features like obvious blobs or edges. In contrast, the rest of the layers would dive deep into each image and extract more detailed features. Interestingly, according to the below table, they found that the optimal performance is done by freezing only the first or the first two convolutional layers’ weights and re-train all other layers.

Figure 3. Comparison of test accuracies for different fine-tuning scenarios. Figure from Cetinic et al. (2018)

Besides focusing on object recognition, the study also discovered that a combination of scene and object recognition using pre-trained CNNs yielded better results than merely object recognition. Furthermore, the dataset was a mixture of Places and ImageNet images. This experiment could indicate that the fine-art classification task might need the scene features to help boost the classification accuracy.

2. Image patches or segmentation

Rodriguez et al. (2018) suggest that each image from the dataset should be equally segmented into five pieces as the CNN model's input. Then, combining the weighted classification results from the previous step, a higher classification accuracy could be expected than classification results with unsegmented images.

The three main steps are as follows: First, each image from the training set is resized to double the current size. Each image is then segmented into five pieces in equal size, where four pieces are the corner pieces with the same height and width as the original image, and the fifth piece being the centerpiece that shares exactly twenty-five percent of each corner piece.

Figure 4. an example of segmenting the original image into five equal size pieces. Image from Rodriguez et al. (2018)

Second, the new set of images (patch-image) is then fed into some pre-trained CNN models to classify the styles of each image segmentation. The pre-trained CNN models used in this experiment are AlexNet, VGG-16, VGG-19, GoogLeNet, ResNet-50, and InceptionV3. It is worth noting that they only replace the last three layers of these neural networks to fit the classes' targeted number. Here, the focus is not on the performance of each CNN but on observing whether there is any trend or indication of a weighted image patch that could bring a higher classification accuracy.

Finally, an optimization procedure is performed for each original image to discover the optimal weights of the class probability for each segmented image according to the classification results from the previous step. Since each image patch is being classified independently, meaning all image patches from the same original image do not influence the classification process, it is possible to have different image patches from the same original image being classified into different target classes (artistic styles). Hence, they applied an optimization process, Genetic Algorithm, to iteratively search for an optimally combinatorial weighted result to finally decide the decision class for each original image. The algorithms and their processes are illustrated in the below figure.

Figure 5. The weight optimization process for each image patch. Image from Rodriguez et al. (2018)

Here, Cn, k is the outcome of each image patch from the CNN, wn denotes the corresponding weight of Cn, k, and CTk is the total result vector of the kth original image. Now, with the optimal weight of each image patch outcome found, the final step is to find the maximum total probability vector, CTk, by the dot product operation of the optimal weight and each image patch’s probability vector. Also, there is a constraint on each image-patch weight being set to be less than or equal to 1 (100%) to guarantee no individual image-patch weights over 100% than others. The final results are shown in the below figure.

Figure 6. Style classification results of a combination of different models and scenarios. Image from Rodriguez et al. (2018)

From the bar chart, we can observe that using the weighted image patches yield higher accuracies in style classification tasks than merely using original images as input data. When images are segmented, it seems helpless in solving this problem. However, considering those individual segments with optimal weights shows how an image patch could boost classification accuracy by contributing higher resolutions. Therefore, the study concluded that images with higher resolutions could add valuable information such as artistic styles for fine-arts style classification tasks.

3. CNN fusion with SVM

A common problem for fine-art classification tasks is the lack of data. Without enough data, it would be difficult for any classification or object detection task. To tackle this problem, Smirnov & Eguizabal (2018) proposed the data augmentation method mentioned above and their training method — two CNN with an SVM. The concept is first to train two CNNs, then concatenate the result vectors from the CNNs and feed them into an SVM for final object detection. The flowchart is illustrated in the below figure. Here, one CNN is for object detection, and the other is for style classification. The PASCAL VOC dataset is trained for the object detection task, and the WikiArt image dataset is trained for style classification. And they used VGG-19 as the model for both CNNs. The weights of both CNNs are kept the same as the original pre-trained VGG-19 except for the last fully connected layer, tailor-made for later concatenation of output feature vectors from both CNNs. The test result shows that it outperforms traditional object detection and classification methods by five percent. This study demonstrates the efficacy of augmenting the fine-art image dataset by applying artistic styles to everyday-images, and the feasibility of fusing two CNNs that train on different subjects (objects and styles) with an SVM.

Figure 7. The flow diagram of two CNNs with SVM. Image from Smirnov & Eguizabal (2018)

Future research trends

Several ways could improve the current approaches to classifying or detecting fine-art images. In the study by Cetinic et al. (2018), they recommend that besides working on fine-art image classification tasks, researchers could also discover intrinsic inter-relationships of image features between digitized artworks. Targets like emotions, quality, and aesthetics in photography and images could also be considered in future research. Besides, it is always a good idea for researchers to collaborate with fine-art historians and experts in related fields to better understand the fine-art domain.

Other studies suggest investigating other neural networks or CNNs to evaluate their computational cost and classification performance and find the optimal equilibrium for this task. For instance, though the InceptionV3 model yields higher accuracy in classifying images, it requires a longer time to train on the same dataset than AlexNet and other simple CNNs. Hence, it is an optimization topic worth investigating.

Creativities

A common problem found in fine-art classification is the high misclassification rate between Impressionism and Expressionism artworks. It is because the latter evolved from the former. However, I came up with an idea- a two-stage classification, to possibly tackle this problem. Initially, all images from the dataset are being classified roughly. Then, a finer classification specifically designed for distinguishing Impressionism and Expressionism digitized images is performed. Before building the second classifier, we need to know some common-known characteristics of the two art styles. Impressionism focuses on the change of lights, and Expressionism mainly pays attention to the expressions on human faces. With this prior information, we can build a classifier that extracts pixel values of images and facial expressions. If the input image shows a strong indication of facial expressions, we can assume that it belongs to an Expressionism artwork. On the other hand, if an image is being detected with abundant shifts of pixel values and without features of faces, we can say that it is a piece of artwork from Impressionism.

Conclusion

Computer vision in fine-art classification is an interesting topic that could bring great value to the real world. As digitized artworks become more prevalent in recent years and the booming market of online art auctions, fine-art classification has started to play a crucial role in these trends. With the technology, it would be easier for all parties to identify different artworks dating backing from ancient history to the modern world. Saving the cost of human labor to confirm, classify or even authenticate each piece of artwork. However, further studies are needed to harness the power of computer vision for humans to benefit from this technology.

References

Cetinic, E., Lipic, T., & Grgic, S. (2018). Fine-tuning convolutional neural networks for fine art classification. Expert Systems with Applications, 114, 107–118.

Rodriguez, C. S., Lech, M., & Pirogova, E. (2018, December). Classification of style in fine-art paintings using transfer learning and weighted image patches. In 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS) (pp. 1–7). IEEE.

Smirnov, S., & Eguizabal, A. (2018, October). Deep learning for object detection in fine-art paintings. In 2018 Metrology for Archaeology and Cultural Heritage (MetroArchaeo) (pp. 45–49). IEEE.

--

--