10 Lessons Learned From Participating in Google AI Challenge

Eric Bouteillon
Towards Data Science
15 min readDec 10, 2018

--

Quick, Draw! Doodle Recognition Challenge was an artificial intelligence challenge sponsored by Google AI and hosted by Kaggle.

My team end up in 46th position in this competition, whereas 1,584 competitors and 1,316 teams participated in. And here are some contexts, insights and feelings I would like to share with you, as well as how we missed the gold medal.

Challenge Overview

The purpose of this challenge is pretty simple, guess what someone is drawing. Well, not you, your computer shall guess. Sounds fun right?

In other words, is your computer smart enough to determine what is drawn below?

Sample images from the quickdraw-dataset

A bit simplistic definition? Here is the official description:

“Quick, Draw!” was released as an experimental game to educate the public in a playful way about how AI works. The game prompts users to draw an image depicting a certain category, such as ”banana,” “table,” etc. The game generated more than 1B drawings, of which a subset was publicly released as the basis for this competition’s training set. That subset contains 50M drawings encompassing 340 label categories.

Sounds fun, right? Here’s the challenge: since the training data comes from the game itself, drawings can be incomplete or may not match the label. You’ll need to build a recognizer that can effectively learn from this noisy data and perform well on a manually-labeled test set from a different distribution.

Your task is to build a better classifier for the existing Quick, Draw! dataset. By advancing models on this dataset, Kagglers can improve pattern recognition solutions more broadly. This will have an immediate impact on handwriting recognition and its robust applications in areas including OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) & NLP (Natural Language Processing).

Team ✎😭😴

First, I really want to thank all my teammates, it hasn’t been possible without them!

Second, let’s explain the context: I love learning new things about software engineering, science, programming… and the artificial intelligence/machine learning is my latest learning passion for about a year now. I am still a novice in this field, but I find it truly fascinating. So to live my new passion I started another new challenge on Kaggle (the 6th to be precise) and it was that Quick, Draw! Doodle Recognition Challenge. So far, I participated solo to all the challenges and I felt ready to join a team, and I published this post on the Kaggle forum, I was about 84th on the leaderboard by then. I got a couple answers from other Kagglers also without a team either through the forum or reaching me with direct messages to get into the team. It was hard to decide who to team up with, so I build up a team with Kagglers about the same level as I was in the leaderboard at that time.

Here is the final magic team:

  • ebouteillon: myself, charmed team leader
  • phun: team’s blending alchemist
  • kalili: team’s deep model wizard
  • YIANG: team’s machine learning magician

We are from China, France and the United States.

For a while, the team had my name, but in the end, we move to “✎😭😴”. A fun team name, but I still don’t know how to pronounce it, but at least it lets you imagine our team story.

Lesson 1: Build a team with people having the same level as you in the competition. That way everyone feels comfortable working together.

During the challenge, I was really impressed by how dedicated were my teammates. Each time a task showed up (train a new model, investigate a solution), they stood up to do it.

(Deep) Learning is Key For Me

I am a learner and so I decide for this competition to use: new hardware, new artificial intelligence frameworks and new technologies. That was pretty fun!

We keep moving forward, opening new doors, and doing new things, because we’re curious and curiosity keeps leading us down new paths. Walt Disney

New Hardware

In previous Kaggle competitions I participated in, I had no GPU with compute capability to use at home, so I usually used the free Kaggle kernels or Google Colaboratory. They are free to use, offer an old and slow but reliable K80 GPU, but they are limited in time. So I decided it was time to pimp up my computer with a brand new graphic card with a decent GPU. I mean a GPU I could use for fun new things like deep-learning. So my computer welcomed a rtx 2080 ti, launched by NVIDIA on September 27, 2018. To give you an idea, the competition started on September 26, 2018 and lasted two months. I couldn’t make it more recent.

For this competition, hardware was key due to the size of the dataset (50 millions images). Winners of this competition showed they had access to an amazing compute capability.

And I can tell you, it is frustrating to not be able to try new algorithms only because you lack access to a GPU.

Lesson 2: Choose wisely the competition you want to participate in based on your compute resources (local or cloud).

New Custom Binaries

The hardware was pretty new but with it came some issues. To use my graphic card with all its capabilities, I had to use NVIDIA’s CUDA 10, which no machine learning frameworks officially supported and no pre-built binaries were available for it.

First, I had to reinstall my Linux computer with Ubuntu 18.04 because NVIDIA drivers 410.72 , CUDA 10.0, cuDNN 7.3.1 and NCCL 2.3.5 was only supported in this LTS (long-term support) version of Ubuntu.

So I had to recompile Google’s tensorflow from source using CUDA 10. Here is a tutorial explaining how doing it.

I also had to compile from source pytorch maintained by Facebook and Uber with CUDA 10.

Note: compile software for your CPU architecture (GCC -march option) may provide a huge boost. In my test enabling AVX2 in tensorflow divided CPU computation time by a factor of 2.

Lesson 3: Optimize your software stack for the task. E.g. Recompile tensorflow from source, as the pre-built downloadable version is compiled for maximum CPU compatibility and not maximum performance.

New Deep Learning Framework

I already used keras which is a great framework for beginners as well as Google tensorflow which is extremely powerful to complement Keras or to use on its own. These are both pretty enjoyable and mature frameworks to work with.

So I decided to try something new by using the fastai 1.0 library. It is a high level library on top of the pytorch framework. So I saw here a good potential to learn new things.

And in the end, I found this fastai library very interesting by offering a high level interface and a simple streamlined flow to use it. But as the library is pretty recent, (first 1.0 version was released on October 16, 2018), the version used (1.0.26) has many quirks (e.g. you cannot perform prediction if find_lr was not run before it), bugs (many functions throw exception when mixed precision is enabled) and the documentation has many lacks (how to have a custom head on a deep learning model?). On top of that, the semantic versioning is not respected, so I warn you to expect code compatibility issues even between patch version (my code wrote for 1.0.26 is incompatible with 1.0.28).

But again, this library has very good things built-in it, like 1cycle policy, a learning-rate finder and a (partial) mixed precision support. It is also fast moving as a new version is released almost daily when I am writing this.

Lesson 4: Read the source Luke. You will learn how these libraries and frameworks work, especially if the documentation is incomplete.

New State-Of-The-Art Technologies

Mixed-precision is one of the key elements which decided me to go with a rtx-2080-ti versus a cheaper gtx-1080-ti. The rtx-2080-ti is the first non-professional graphic card from NVIDIA supporting natively 16-bits floating point (fp16).

You may ask what is that mixed-precision you keep talking to me? It is a set of tricks to allow the use of floating point on 16-bits instead of 32-bits in a deep neural network and thus reducing memory by a factor of two and increasing computation throughput by the same factor, or more.

For a more precise definition of mixed precision you can refer to the NVIDIA developer blog:

Deep Neural Networks (DNNs) have lead to breakthroughs in a number of areas, including image processing and understanding, language modeling, language translation, speech processing, game playing, and many others. DNN complexity has been increasing to achieve these results, which in turn has increased the computational resources required to train these networks. Mixed-precision training lowers the required resources by using lower-precision arithmetic, which has the following benefits.

Decrease the required amount of memory. Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Lowering the required memory enables training of larger models or training with larger minibatches.

Shorten the training or inference time. Execution time can be sensitive to memory or arithmetic bandwidth. Half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.

Since DNN training has traditionally relied on IEEE single-precision format, the focus of this this post is on training with half precision while maintaining the network accuracy achieved with single precision […]. This technique is called mixed-precision training since it uses both single- and half-precision representations.

NVIDIA also provides a 6-minutes video explaining how to implement it using tensorflow:

Lesson 5: Take full advantage of your hardware, use mixed-precision if your hardware is compatible.

Key Points of My Work

Disclaimers: I will present only a portion of the code I wrote for this competition, my teammates are absolutely not responsible for my awful and buggy code. A portion of this code is inspired by great Kagglers sharing their insights and code in Kaggle kernels and forums. I hope I did not forget to mention one of them where credit is due.

Full source code is available on my github repository.

This part may contain some technical terms.

Use of SQLite

For this challenge a training dataset was provided. It contains 50 millions images with details for each image: in which category the image falls, a unique identifier (key_id), in which country the drawing come from (country) and the strokes points (drawing) to reproduce the drawing.

The challenge training dataset is provided in the form of CSV files (340 files exactly), one file per category (one files for airplane category, another one for watermelon…). This is not the most convenient when you want to randomly shuffle samples to train a deep learning model. So I had to convert them into a more convenient way.

First, I went for the fastai’s usual and deeply integrated practice; which is to have one image file per sample. But with that solution, with such a big dataset, I faced another challenge: my Linux filesystem wasn’t configured to support this huge number of inodes (50 millions files added to the filesystem thus 50 millions new inodes created) and ended up with a filesystem full without using all the GB available. Another solution had to be used. There are many file formats which could contain data (pickle, hdf5, dask, CSV…). After some experiments, I came up with a SQLite database approach.

The database is generated using the Jupyter notebook available on my github and named 1-concat-csvs-into-sqlite.ipynb. The data-structure implemented in the database is pretty simple:

  • train table with the training samples
  • test table with the test samples
  • classes table with each available category encoded with an integer
Overview of the database structure

This approach allowed me to keep all training and test dataset in a single file. During my various experiments, I also noticed that the SQLite approach had the added advantage to reduce the numbers of I/O on my SSD compared to the commonly used one image file per sample.

Lesson 6: Don’t be afraid to use a database. If a database server is overkill, then think about SQLite.

If you look at my code below, you may notice that I am using parallel workers to convert on the fly images from a list of strokes to an image; and it is done using a global lock to access the database. My initial intention was to implement parallel workers accessing the database in read-only mode by providing one SQLite file object per worker, but before going that way I first tried using a global lock on a single SQLite object and it surprisingly gave immediate decent results (GPU used at 98%), that’s why I did not even bother to improve it.

Lock in image generator callback

Strokes to Image

An important part of this competition was to turn a list of strokes into something useful. There were two main approaches described in the forums. Either use strokes as a variable size input, for which a Recurrent Neural Network (RNN) can be used (with LSTM or GRU layers), or convert them into an image for which we can use a Convolutional Neural Network (CNN). I personally only explored the latest.

The encoding of strokes into an image is simple, it only requires to draw lines between strokes points. But to provide as much information as possible to the CNN, we need to help him to distinguish drawn strokes from each others, for instance using different colors.

To do so I get inspired from Beluga’s kernel, which serve as a baseline for many competitors:

Beluga’s stroke to image function

But I noticed some issues with this piece of code:

  • Drawing the image at 256x256 px and then resizing it to 64x64 px is CPU costly and makes the image blurry.
  • Images generated are on only one channel.
  • This piece of code does not properly handle image borders, small portions of the drawing are off the image.
  • If you want to train a neural network with larger images, then line width shall be chosen wisely to deal with the image resizing.
Images generated, sample taken from Beluga’s kernel

Following improvements were done:

  • Avoid drawing lines outside of the image.
  • Compute lines position directly in the final space, avoiding useless image resizing.
  • Add a small border (2 px) so that first convolution layer (3x3) can more easily detect lines.
  • Use 3 channels to encode colors. The color palette is selected so that lines do not appear on each layer, but consecutive strokes have at least one channel in common.
  • The first strokes generally represent the global shape which helps to identify the object and latest strokes are simple details. So strokes are drawn in reverse order so that last strokes does not overlap first strokes.
  • Line width is a hyper-parameter independent of the final image size, which ensure faster transfer learning on larger images.

Note: I did not encode velocity or time provided, it would probably add more information that the CNN could use.

My strokes to images code
Image generated

Lesson 7: Kernels shared by other competitors are often a great baseline. Take inspiration from them and improve their work to get you to the top.

Convolutional Neural Network Model

Now that the data are in a ready to use format, I had to choose which artificial intelligence algorithm to use. Deep learning with convolutional neural networks is the state-of-the-art for image recognition. Obviously, I choose to go that way.

Image result for image ai machine learning deep learning
https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55

In this blog post and in the shared source code, I will describe a single deep-learning model to avoid useless complexity. So I am using a resnet-18 model . This model is a “shallow” deep learning model, and deeper models would definitely give better results on this dataset, e.g. se-resnext-50, xception. I decided not to present them here as they require much more time to train and won’t provide extra value for the purpose of this quick presentation. Of course our team trained deeper models to reach the 46th position. 😄

This resnet-18 model was trained from scratch without any pre-trained weights (e.g. based on imagenet). I made this decision based on the fact the dataset with its 50 millions images is pretty huge and imagenet images (real world pictures) are pretty far from a 10-seconds drawn sketch. This assumption was confirmed by empirical tests on my side. Models with pre-trained weights gave good results in first epoch, but with more epochs were outperformed by “scratch” models with randomly initialized weights. I guess, if I had run even more epochs on both pre-trained and randomly initialized models, then they will probably give similar results in the end. But for the allocated time, “scratch” model were better.

resnet-18 layers

The model presented is a stock torchvision model, but with a custom head similar to the original, but in which I replaced the final pooling layer with an adaptive counterpart (AdaptiveAvgPool2d) to handle gracefully images of different resolutions. I also deliberately removed the custom head added by fastai which is good for pre-trained models, but I felt unnecessary for randomly initialized weights models (non-pre-trained).

The fastai documentation is clearly lacking that a custom is added and how to replace it with a custom head. Here’s how I solved it:

Customized head for scratch resnet-18

Lesson 8: Choosing models is a key part in a deep-learning solution. Do not hesitate to adapt them. It is often a trade-off between training dataset size, compute power and time.

Data-inception

This model was trained using 128px images and then fine-tuned using 144px images.

Results

To show you how our resnet-18 performs, below are the first images we should guess, and what our resnet-18 thinks are the three most probable categories:

Sample results from our resnet-18

This extremely simple resnet-18 model already gives you a bronze medal (around 104th position in leaderboard).

Private Leaderboard for our sample resnet-18

I am still amazed by the super-human results obtain with deep learning. Our final score is 0.94493, which could be roughly interpreted as our solution got the right category 94.5% of the time. This result was obtained by stacking different models (kalili is our model master by far).

How We Missed the Gold Medal

During the competition I noticed a fun fact that I shared with my teammates: the number of images we have to guess the category was one off 340 x 330. This was surprising because there were 340 possible categories. Therefore, I thought that the test set could be perfectly balanced and that it was made up of about 330 images per categories. YIANG probed the leaderbord by submitting a solution with all images set to a single category, the score he gets back was 0.002, which means there were between 224 and 336 images for that probed category, which of course was inconclusive to determine if the test set is balanced or not. This idea of a balanced test set was then forgotten as the end of competition was getting close.

After the end of the competition, the winners posted a description of their solution on the Kaggle forum. They described a “secret sauce” a.k.a. “щепотка табака” which gave them an extra 0.7% boost on the leaderboard. What is that “secret sauce”? Well, they exploited the fact the test set was balanced. When I read that, I told myself that I should have spent more time trying to re-balance the solution. 😢

kalili did try to re-balance our best solution after the end of the competition and he get an extra boost of 0.6%, which (in theory) would give us up to the 9th position. This position entitles a gold medal as per Kaggle Progression System.

Private leaderboard results with and without re-balancing the solution

It made me sad to have missed this opportunity of getting our first gold medal, in peculiar when you got the key idea to get it and did not take advantage of it. Anyway, we earned a silver medal and that is already pretty cool.

Lesson 9: Take time to explore ideas. Think twice and don’t do everything in a rush even when time is running out.

Conclusion

It was a great pleasure to run this contest, and I really appreciated to be on a team for the first time. I wrote this blog post because I usually look forward winners describing their solutions in forums, they often have neat tricks to learn from. So here my last lesson learned.

Lesson 10: Learn from the best, and share back with others. Writing things down often clarifies your thoughts and may also help others.

Thanks for reading!

--

--

Product Owner at Ingenico. He has a Master of Science in Physics and a Master of Advanced Studies in Photonic, Image and Cybernetics. Kaggle Master.