The world’s leading publication for data science, AI, and ML professionals.

Real-World ML: Warehouse Recognition System

A Step-by-Step Description of the Real ML Project

Making Sense of Big Data

Source image from pxhere under Creative Commons CC0 Licence. Small images at Top-Left are by Author.
Source image from pxhere under Creative Commons CC0 Licence. Small images at Top-Left are by Author.

One of the projects I recently had the opportunity to work on as a Team Lead of Rapid AI Solutions Prototyping Group at ICL Services [1], was implementing a Warehouse System for Recognition of Stored Factory Parts. The Problem is easy to understand: Warehouse workers (especially newbies) often can’t recognize new incoming items and aren’t able to find their storage locations fast enough. When you have tens of thousands of different item types, the task becomes nontrivial and turns into flipping through catalogs and exploring warehouse rows in the hope to come across the item you are interested in, which might take up to half an hour and is a clear waste of time.

A solution that immediately comes to mind is building a system that’s driven by Computer Vision and tells you the name of the item of interest and shows the location in a warehouse where similar items are stored.

The direct solution is to take tens of photos of each part (out of about 10 thousand) and use those to train a parts classifier. And then as new parts will continue being added into the catalog take more photos and retrain the system. The solution would work, but… creation of such a training dataset would take several months of labor and thereafter will need permanent control over the growing dataset. But Customers usually want to get results faster, cheaper, and without the necessity to regularly pay for expensive manual retraining of the system.

Is there anything we can do?

We were lucky and the answer in our case was yes. A nuance that allowed us to significantly reduce the cost and duration of the project is that in our case we deal with a factory warehouse, and stored parts are all flat, although they are made from steel sheets of different thicknesses. And more importantly, there are CAD models for all the parts.

Fig. 1. An example of a CAD model in Blender [2] (Image by Author)
Fig. 1. An example of a CAD model in Blender [2] (Image by Author)

Accordingly, the optimal solution we came up with is to train the system on synthetic images generated from CAD models, but then use it on real photos. Such an approach eliminates the necessity to collect a huge dataset from photographs of real objects. In our case, this becomes possible, given that all the parts are flat.

To make this happen we use a pipeline of 2 models:

  • Segmentation Model, which will give a mask for an input photo of an object (the mask of a flat object uniquely defines a part; examples of masks can be found below in Fig. 5);
  • The Classification Model, which is looking at the mask of the object that is fed as the input and recognizes the part.

The Classification Model is the classic ResNet-50 [3], pre-trained on ImageNet [4]. The dataset is created in a rather straightforward way. Using the available CAD models and scripts for Blender [2], we render masks of our parts with a variety of offsets from the center of the scene and at different Camera angles relative to the vertical (this is necessary because despite the parts are flat, Camera can shoot them under different angles; we allowed a deviation of up to 30 degrees from the vertical). The number of classes is equal to the number of parts in the catalog. When new parts are being added to the catalog (which is when new CAD models are detected on the special network share), the model is being automatically re-trained, which takes several hours and takes place on the same on-premise GPU that in parallel is being used for inference.

The Segmentation Model is a bit more complicated. It is necessary to train the model to segment the parts using synthetic data so that thereafter segmentation will accurately work on real photographs at different illumination levels, being tolerant to changes in the material texture of the parts, to the changes in the background, to the shadows.

The Segmentation Model comprises a classic U-Net [5] (a binary segmenter, where for each pixel of the image we determine whether it belongs to a part or not, trained with Dice Loss [6]) based on the same ResNet-50 [3], pre-trained on ImageNet [4]. While nontrivial, such a synthetic dataset for training to segment images was built by simulating the variety of possible appearances of parts that we compose from a random combination of:

  • parts masks (strictly speaking, any masks will work, not only the masks of factory parts);
Fig. 2. Examples of masks for generating synthetic images of parts (Image by Author)
Fig. 2. Examples of masks for generating synthetic images of parts (Image by Author)
  • background textures (from any free Clip Art collections);
Fig. 3. Examples of background textures for generating synthetic images of parts. (Left) Photo by Jason Dent from Unsplash. (Right) Photo by Author
Fig. 3. Examples of background textures for generating synthetic images of parts. (Left) Photo by Jason Dent from Unsplash. (Right) Photo by Author
  • part material textures (from any free Clip Art collections);
Fig. 4. Examples of material textures for generating synthetic images of parts. (Left) Photo by Malik Skydsgaard from [Unsplash](https://unsplash.com/photos/kG-ZwDuQ8ME). (Right) Photo by Annie Spratt from Unsplash
Fig. 4. Examples of material textures for generating synthetic images of parts. (Left) Photo by Malik Skydsgaard from [Unsplash](https://unsplash.com/photos/kG-ZwDuQ8ME). (Right) Photo by Annie Spratt from Unsplash
  • how glares on edges might look like (see examples of the generation below in Fig. 6);
  • how part edges might cast shadows onto a background (see examples of the generation below in Fig. 6);
  • how shadows might be cast onto the whole scene (see examples of the generation below in Fig. 6).

The topic is quite complex, but a decent computer graphics specialist will easily solve the problem of generating synthetic examples for segmentation in a reasonable time. I’m not going to present the formulas here and will only give the examples of masks (Fig. 5), which are relatively easy to obtain from CAD models, and examples of procedurally generated pseudo-photographs from such masks (Fig. 6).

Fig. 5. Examples of part masks for generating synthetic images of parts (Image by Author)
Fig. 5. Examples of part masks for generating synthetic images of parts (Image by Author)
Fig. 6. Examples of synthetic images of parts generated from masks (Image by Author)
Fig. 6. Examples of synthetic images of parts generated from masks (Image by Author)

Let’s add some augmentation [7]. Now examples for training the segmenter look like this.

Fig. 7. Augmented examples for training the Segmentation Model (Image by Author)
Fig. 7. Augmented examples for training the Segmentation Model (Image by Author)

After training data is ready the rest is straightforward. We train the Segmentation Model to segment images and train the Classifier Model to classify the part masks. And then we check the classification accuracy on a test set (in our case, we had 177 real parts photos).

Fig. 8. Results of classification on several examples of the test set (Image by Author)
Fig. 8. Results of classification on several examples of the test set (Image by Author)

And we get:

Correct predictions: 100.00% (177 out of 177)

Well, here we were just lucky and the entire test set was recognized correctly 100%, although the results may differ slightly from run to run. The randomness comes from the Test Time Augmentation (TTA) [8] technique that we use for both segmentation and classification since TTA can reduce the error by about 10%. Therefore, the classification process is non-deterministic and depends on the random seed of TTA. If we look at the average accuracy over 10 runs, then it turns out to be about 99% (we will get more objective figures later, when we’ll get access to a large number of real images and compose a full-scale test set).

It remains to wrap it all up in a simple user interface, which looked like this in the first version.

Fig. 9. The User Interface of the first version of the system (Image by Author)
Fig. 9. The User Interface of the first version of the system (Image by Author)

Here we see the image from the Camera (in this case, the test bench Сamera, under which we place printed black and white photo of a part), the result of processing the part by the Segmentation Model and Top 16 model predictions (if the model occasionally makes a mistake in the Top 1 prediction, then with a probability greater than 99.9% correct part will be found in the Top 16 list). For the predicted part, we see its name and a position on the warehouse map, with an additional indication of the shelf number.

Thus, the warehouse workers now always have the means to quickly understand what kind of part they came across and where such parts are stored. To do this, they just need to place the part in the field of view of the Video Camera at the entrance to the warehouse.

PS:

This article is the first in a planned "Real-World ML" series, where I plan to share the stories of creating products involving ML, that we build with our teams of data scientists.

I also run a microblog https://twitter.com/AiParticles, in which I overview key points and ideas in recent cutting-edge works in the Machine Learning field.

Feel free to leave your feedback in the comments. Good luck to all!

References

[1] ICL Services – https://icl-services.com/eng

[2] Blender – https://www.blender.org

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, et al. Deep Residual Learning for Image Recognition. https://arxiv.org/abs/1512.03385

[4] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li Fei-Fei, et al. ImageNet Large Scale Visual Recognition Challenge. https://arxiv.org/abs/1409.0575

[5] Olaf Ronneberger, Philipp Fischer, Thomas Brox, et al. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/abs/1505.04597

[6] Fausto Milletari, Nassir Navab, Seyed-Ahmad Ahmadi, et al. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. https://arxiv.org/abs/1606.04797

[7] Data Augmentation – https://en.wikipedia.org/wiki/Data_augmentation

[8] Test Time Augmentation – https://stepup.ai/test_time_data_augmentation


Related Articles