Kaggle Avito Demand Challenge: 18th Place Solution — Neural Network

Kung-Hsiang, Huang (Steeve)
Towards Data Science
4 min readJun 30, 2018

--

A few days ago, I just won a silver medal with my teammates in a Kaggle competition hosted by Avito, a Russian advertising company, ending at the 18th place. The goal of this challenge is to predict the demand of an online classified ad based on the data they provide. In this writeup, I will illustrate my approach, Neural Network (NN) which I exclusively worked on (my teammates mainly takes care of tree-based and other linear models). Then, I will talk about lessons I learned from the solution of top winners.

My Approach

NN Structure

As shown on the above image, my NN model is composed of 4 different modules which uses all the data provided by the organizer, image, categorical, continuous, and text data. I will explain each of the sections in the following paragraphs.

Continuous

This is the most non-surprising section. The input tensor of the continuous features is directly concatenated with other modules. One thing to note is handling null values. For missing continuous data, I fill in either 0 or mean values.

Categorical

For categorical data, an embedding layer is applied to learn the latent representation of these discrete values. I know that this may not be a new idea, but it was my first time to use categorical embedding since I have never used NN to deal with structured/ tabular data. The concept of categorical embedding is similar to word embedding. The categorical values are mapped onto learnable embedding vectors so that those vectors contain meanings in the latent space. This help avoid the sparsity of one-hot encoded categorical features and improve model’s performance.

Text

The text section of my NN is relatively simpler than other top winners’ approach. There is no sophisticated recurrent unit or convolution layer, neither does it use pre-trained embedding. I am not sure why, but neither of them works on my NN model. The only trick here is the use of a shared embedding layer, motivated by the second place solution in the Mercari challenge. The two text entries, title and description, are embedded based on the same embedding matrix. It not only helps speed up the training of NN, but it also leads to faster convergence and lower loss.

Image

My first approach for image data was to use pre-trained ImageNet models to extract features with or without the head of these models. I have tried ResNet50 and InceptionV3; unfortunately, none of them worked. At the time when there was around 2 weeks to go in the competition, someone in the discussion forum said his model included several convolution layers to train raw image together with other features. Therefore, I started to rewrite my code so that it leverages a generator to read image and tabular data as it was impossible to load all the image data into the RAM. After trying out a few structures, I found out that 1 InceptionV3 cell + a couple of convolution layers work best for me (Since I only had a K80 GPU on GCP, it takes very long time to validate the results for just a few experiments).

Lessons Learned from Top Solutions

  1. The first place NN solution also encountered the poor performance of the extracted features from most of the pre-trained ImageNet models. They winded up using the VGG top layers + ResNet50 middle layers. The biggest difference between their approach and my previous one is that before the extracted image features concatenated with other entries, they applied average pooling and added a dense layer.
  2. Categorical feature interaction: concatenate two categorical features and treat it as a new feature.
  3. Unsupervised learning: use an autoencoder to extract vectors from categorical data.
  4. Validation strategy: beware that the overlapped feature value between each fold should be similar to that of train/test split. (especially user id in this competition)
  5. Loss function: all the top 3 solutions use binary cross-entropy as loss function, while I was using MSE for the entire competition. I should have tried out more loss function such as BCE and Huber loss.
  6. Stacking: we started stacking one week before the competition ends, so we only have a few base models with shallow stacking. Almost all the top solutions used large number of models for wider and deeper stacking (the second place winner used 6 layers …)

I was having lots of fun in this competition. I would like to thank my teammates, all the people sharing their ideas/ solutions publicly. I learned very much from you! I would also like to thank Kaggle and the organizers for hosting such a great competition. Without you, I would not be able to sharpen my machine learning skills.

If you want to know more about my solution, you can refer to this Github repo.

--

--