Clothes and color extraction with Generative Adversarial Network

In this post I’m going to talk about clothes and color extraction from images using Generative Adversarial Network (GAN).

Aytan Abdullayeva

Published in

Towards Data Science

6 min readSep 11, 2019

Let’s briefly go through the key notions used in this article.

Some key notions:

Image segmentation: In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels). The goal of segmentation is to simplify the representation of an image into something that is more meaningful and easier to analyze.

There are two levels of granularity within the segmentation process:

Semantic segmentation — classifies an object’s features in the image and divides sets of pixels into meaningful classes that correspond with real-world categories.
Instance segmentation — identifies each instance of each object featured in the image instead of categorizing each pixel like in semantic segmentation. For example, instead of classifying nine chairs as one instance, it will identify each individual chair (see image below).

**Semantic Segmentation vs Instance Segmentation (source**)

Neural Network is a computational learning system that uses a network of functions to understand and translate a data input of one form into a desired output, usually in another form. The concept of the artificial neural network was inspired by human biology and the way neurons of the human brain function together to understand inputs from human senses.

Generative Adversarial Network(GAN) is a specific type of neural network(Invented by Ian Goodfellow and his colleagues in 2014) which consists of two networks: the Generator part and the Discriminator part. In the training process these networks compete with each other, thus improving one another. The training process starts with random noise, the generator generates output from the noise (because input is some random data, initial outputs are also noisy). The Discriminator compares output with the data from the dataset and realizes that it is far from the pieces from the original dataset so it returns a negative result to the generator. Over time, the generator produces better outputs according to the feedback collected from the discriminator. In the end, the discriminator finds it hard to understand whether the output produced by the generator is original or fake. The key point of this training process is to find the probability density function for the training set and using this function to generate new pieces which do not exist in the dataset.

So many interesting applications of GAN exist: text-to-image, image-to-image translations, image resolution enhancement, detecting fake paintings, fake face generator (by NVIDIA) etc.

In this post I’m going to focus on Image-to-image translation.

Clothes segmentation model

For the training process I’ve used Clothing Co parsing(CCP) dataset which contains:

2, 098 high-resolution street fashion photos with 59 tags in total
A wide range of styles, accessories, garments, and poses
All images are with image-level annotations
1000+ images are with pixel-level annotations

For training purposes only 1000 pixel-wise segmented images are used. As a network model I used pix2pix implementation of GAN in Keras from the following repo source. I’ve changed the initial code slightly. The size of the model’s input/output was 256x256(input image resolution). For better quality I changed this size to 512x768.

Why did I decide to use a GAN network? Segmentation tasks could also be done by CNNs. But in comparison to CNNs, a GAN model could learn from lack of data and generate better results. In the training process, the generator saw both original images and their paired pixel-wise segmented versions from the training set.

First, I tried to train the network with an original CCP dataset, meaning for 59 classes. The experiment failed because the distribution of different types of clothes in the dataset was different, and also, the training set contains a small number of instances per some classes. For example, the number of trousers/pants is about 400, while the number of t-shirts is less than hundred. So an imbalanced and small dataset caused a poor quality of segmentation in inference mode. Then I experimented with merging some classes together (pants/trousers/jeans, all types of shoes, all types of bags etc.). This step reduced the number of classes to 29, but still the quality of segmentation was not as good as desired.

Finally, I succeeded with a model for “four classes”: background, skin, hair and all clothes and accessories on the person. This kind of dataset is balanced, because all images in the training set contain instances of all four classes (almost all images with hair). The model trained 3000 epochs for 1k images.

Sample image from dataset for “four classes”

While testing in inference mode with hundreds of images, I realized that the model is sensitive to background: input images for non-blurred background, clothes and background have similar colors. These are cases of failure. For the artificial neural network model, cases of failure are normal, since it doesn’t learn things as the human brain does. Because the network learns from shapes, colors, and a lot of features which are not obvious to humans, we can’t predict with certainty how it’ll work in different cases.

Left couple: background is not blurred, right couple: background is similarly colored to clothes

But our model works quite well for images with blurry backgrounds:

So how can we solve this problem? I tried to replace the background of the original image with a solid color manually and realized that with this kind of input the model produces much better results. But how can this job be automated? Mask-RCNN to the rescue!

Mask R-CNN is an extension of the Object detection algorithm Faster R-CNN with an extra mask head. The extra mask head allows us to pixel-wise segment each object and also extract each object separately without any background.

A sample mask R-CNN output trained on COCO-dataset (source)

I’m going to use a mask (for a person) generated by the Mask-RCNN model.

I used the following implementation of mask-rcnn in keras: source

So before feeding the input image into our model, the preprocessing job is done. First the person is segmented from the input image using the Mask-RCNN model and the mask it produces. With this mask we could substitute all pixels which do not belong to the person with a solid color. In our case it’s grey. I tried 3 colors (black, white, grey) for the background, and only grey passed the tests (The reason for this is not clear to me, it’s magic! I hope to find the answer very soon).

Thus instead of feeding original image to the model, background changed version is provided as an input.

From left to right: original image, segmentation result by our model, result after preprocessing with Mask-RCNN model. (case1 image source)

Model use cases

So far so good. But how can we use these pixel-wise segmented images?

It’s possible to cut clothes from the image, and besides that to get the main colors of the clothes, as well as skin and hair color. The cut off process for different classes from the segmented image is done by color thresholding, thus we get masks for them. For this purpose I’ve used HSV (Hue, Saturation, Value) color space. The reason why we would use this space instead of RGB is because HSV space describes colors similarly to the way the human eye tends to perceive them. In RGB color space all three components are related to color light, but in HSV space, hue vary relatively less than the changes in external lightning. For example, two shades of red might have similar hue values, but totally different R, G and B values.

Use-cases of model result (image source)

References:

· https://missinglink.ai/guides/neural-network-concepts/instance-segmentation-deep-learning/

· https://deepai.org/machine-learning-glossary-and-terms/neural-network

· https://medium.com/@fractaldle/mask-r-cnn-unmasked-c029aa2f1296

Clothes and color extraction with Generative Adversarial Network

In this post I’m going to talk about clothes and color extraction from images using Generative Adversarial Network (GAN).

Some key notions:

Clothes segmentation model

Model use cases

Written by Aytan Abdullayeva