The world’s leading publication for data science, AI, and ML professionals.

Zalando Dress Recomendation and Tagging

Utilize images and textual descriptions to suggest and tag products

Photo by Becca McHaffie on Unsplash
Photo by Becca McHaffie on Unsplash

In Artificial Intelligence, Computer Vision techniques are massively applied. A nice field of application (one of my favorite) is the fashion industry. The availability of resources in terms of raw images allows for developing interesting use cases. Zalando knows this (I suggest to take a look at their GitHub repository) and frequently develops amazing AI solutions, or publishes juicy ML research studies.

In the AI community, Zalando research team is also known for the release of Fashion-MNIST, a dataset of Zalando’s article images, which aims to replace the traditional MNIST dataset in the study of Machine Learning. Recently they released another interesting dataset: the Feidegger. A dataset composed of dress images and related textual descriptions. Like the previous one, this data was donated by Zalando to the research community to experiment with various text-image tasks such as captioning and image retrieval.

In this post I make use of this data to build:

  • a Dress Recommendation System based on image similarity;
  • a Dress Tagging System based only on the textual description.

THE DATASET

The dataset itself consists of 8732 high-resolution images, each depicting a dress from the available on the Zalando shop against a white-background. For each of the images were provided five textual annotations in German, each of which has been generated by a separate user. The example below shows 2 of the 5 descriptions for a dress (English translations only given for illustration, but not part of the dataset).

source Zalando
source Zalando

At the beginning the dataset stores for each singular description the related image (in URL format): we have for a singular dress plus entries. We start to merge the description of the same dress to easily operate with images and reduce duplicates.

data = pd.read_csv('./FEIDEGGER.csv').fillna(' ')
newdata = data.groupby('Image URL')['Description'].apply(lambda x: x.str.cat(sep=' ')).reset_index()

DRESS RECOMMENDATION SYSTEM

In order to build our dress recommendation system, we make use of transfer learning. In detail, we utilize the pre-trained VGG16 to extract relevant features from our dress images and build a similarity score on them.

vgg_model = vgg16.VGG16(weights='imagenet')
feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)

We ‘cut’ the VGG at the second-last layer, so we obtain for every single image a vector of dimension 1×4096. At the end of this process we can plot all our features in a 2D space:

TSNE on VGG features
TSNE on VGG features

To test the goodness of our system we keep away a part of our dresses (around 10%). The rest is used to build the similarity score matrix. We’ve chosen as similarity score the cosine similarity. Every time we pass a dress image to our system, we compute the similarity with all our dresses stored in ‘train’ and then we select the most similar (with the highest similarity scores).

sim = cosine_similarity(train, test[test_id].reshape(1,-1))

Here I report some examples, where the ‘original’ image is an image of a dress coming from the test set. The dresses on the right are the 5 most similar, referring to the ‘original’ dress that we’ve previously passed.

Not bad! The VGG is very powerful and does a very good job!

DRESS TAGGING SYSTEM

The approach we followed, to develop the dress tagging system, is different from the previous one for the dress similarity. This scenario is also different from a classical problem of tagging where we have images and the relative tags in the form of single words. Here we have only text descriptions of dresses and we have to extract information from them. This is a little bit tricky because we have to analyze free text written by humans. Our idea is to extract the most significant words from descriptions in order to use them as tags of images.

Our workflow is summarized in the graph below:

The image descriptions are written in basic german… Zum Glück spreche Ich wenig Deutsch (hopefully I speak a little bit german) so I decided to work with german and in case of difficulty to ask Google Translate.

Our idea is to develop two different models; one for nouns and another one that deals with adjectives. To operate this separation we initially make POS tagging on the image descriptions of our original dataset.

tokenizer = nltk.tokenize.RegexpTokenizer(r'[a-zA-ZäöüßÄÖÜ]+')
nlp = spacy.load('de_core_news_sm')
def clean(txt):
    text = tokenizer.tokenize(txt)
    text = nlp(" ".join(text))
    adj, noun = [], []
    for token in text:
        if token.pos_ == 'ADJ' and len(token)>2:
            adj.append(token.lemma_)
        elif token.pos_ in ['NOUN','PROPN'] and len(token)>2: 
            noun.append(token.lemma_)            
    return " ".join(adj).lower(), " ".join(noun).lower()
adj, noun = zip(*map(clean,tqdm(data['Description'])))

After we combine all the adjectives, referred to the same images (do the same with nouns).

newdata = data.groupby('Image URL')['adj_Description'].apply(lambda x: x.str.cat(sep=' XXX ')).reset_index()

At this point, to extract significative tags for every image, we apply the TFIDF and get the most important ADJs /NOUNs based on this score (we’ve selected the 3 best ADJs /NOUNs. If no words are found, return a series of ‘xxx’ only for efficiency). I also compute a series of ambiguous ADJs/NOUNs to exclude.

def tagging(comments, remove=None, n_word=3):

    comments = comments.split('XXX')
    try:
        counter = TfidfVectorizer(min_df=2, analyzer='word', stop_words=remove)
        counter.fit(comments)
        score = counter.transform(comments).toarray().sum(axis=0)
        word = counter.get_feature_names()
        vocab = pd.DataFrame({'w':word,'s':score}).sort_values('s').tail(n_word)['w'].values
        return  " ".join(list(vocab)+['xxx']*(n_word-len(vocab)))
    except:
        return  " ".join(['xxx']*n_word)

For every dress, we end up with at most 3 ADJs and 3 NOUNs… We are ready to build our models!

To feed our model we make use of previously used features, extracted with VGG. In our case, every dress makes an appearance at most 3 times, with at most 3 different labels (referred to 3 different ADJs/NOUNs). The models we utilize are very simple and have the same structure, as shown below:

inp = Input(shape=(4096, ))
dense1 = Dense(256, activation='relu')(inp)
dense2 = Dense(128, activation='relu')(dense1)
drop = Dropout(0.5)(dense2)
dense3 = Dense(64, activation='relu')(drop)
out = Dense(y.shape[1], activation='softmax')(dense3)
model = Model(inputs=inp, outputs=out)
model.compile(optimizer='adam', loss='categorical_crossentropy')

Let’s see some results!

We test our models on the same previous dresses and plot the first two labels with the highest probability, for ADJs and NOUNs (I also provide translation). The results are great! Jointly, our models are able to describe well dresses shown in the images.

SUMMARY

In this post, we make use of transfer learning to directly develop a content-based recommendation system. In the second stage, we try to tag dresses extracting information only from the textual description. The results achieved are beautiful and easy to observe, as well as advising if you’d like to renew your wardrobe.


CHECK MY GITHUB REPO

Keep in touch: Linkedin


Related Articles