A high-speed computer vision pipeline for the universal LEGO sorting machine

Daniel West
Towards Data Science
8 min readAug 1, 2019

--

For the past few years I’ve been designing and building a machine that can recognize and sort LEGO bricks. A key part of the machine is the Capture Unit — this is a small mostly-enclosed chamber that features a belt, a light, and a camera.

You’ll see the light a bit later.

The camera takes photographs of LEGO parts coming along the belt, then sends images of the parts wirelessly to a server which runs an AI algorithm to identify the part out of thousands of possible LEGO elements. I’ll give more details about the AI algorithm itself in future articles, but this article will be dedicated to the processing that occurs between the camera’s raw video output and the input to the neural network.

The core problem I need to solve is converting a live video stream of the belt into individual images of the parts that the neural network can use.

The final goal: Go from raw video (left) to a bunch of uniformly sized images (right) to send to the neural network. (Gif is slowed down to about 50% speed vs realtime)

This is a great example of a problem which appears straightforward on the surface, but actually presents a lot of unique and interesting roadblocks, many of which are unique to computer vision platforms.

Extracting relevant parts of an image in this way is often called ‘object detection’. That’s exactly what I need to do: detect the presence of objects, their location, and size, so that I can generate bounding boxes for every part on every frame.

The key is to find good bounding boxes (shown in green)

I’ll go over three aspects of the solution:

  • Setting myself up for success by eliminating extraneous variables
  • Building the pipeline out of straightforward CV operations
  • Maintaining good performance on the limited Raspberry Pi platform

Eliminating extraneous variables

For problems like these, it’s best to eliminate as many variables as possible before attempting to apply computer vision techniques. I don’t want to care about environmental conditions, different camera positions, or loss of information due to occlusions, for example. It is possible (if very difficult) to address all of these variables in software if necessary, but luckily for me I’m designing this machine from scratch - I can set myself up for success by removing these kinds of variables before any code is even written.

The first step is to force a fixed camera position, angle and focus. That’s simple, the rig has the camera locked in place above the belt. I don’t need to worry about occlusions either; unwanted objects are unlikely to start wandering into the capture unit. Slightly harder, but very important, is enforcing consistent lighting conditions. I don’t want my object detector to falsely interpret a passerby’s shadow as a physical object. For many computer vision applications, enforcing lighting is very difficult or impossible. Thankfully, the capture unit is super small (the camera’s entire field of view is smaller than a loaf of bread!) so I have more than usual control over the environment.

View from inside the capture unit. The camera is in the top third of the frame.

One option would be to make the box completely enclosed so that no light from the outside environment can enter. I tried this, using LED light strips as the source of light. Unfortunately it’s very finnicky — one tiny hole in the enclosure and light can come pouring in, throwing off any object detection.
In the end, the best solution was to ‘outcompete’ other light sources by absolutely blasting the tiny chamber full of light. It turns out that the sort of lights that can illuminate an entire room are very cheap and simple to use.

Take that, shadows!

When directed into the tiny chamber, the light far overwhelms any potential interference. As a happy side-effect, this wealth of light means that the camera can use a very fast shutter speed, taking perfectly crisp images of parts even as they race along the belt.

The object detector

So how do I take this nice, consistently lit video and turn it into useful bounding boxes? If you’re an AI practitioner, you might suggest implementing an object detection neural network like YOLO or Faster R-CNN. These neural networks could easily accomplish the goal. Unfortunately, I’m running the object detection code on a Raspberry pi. Even a high-end computer would struggle to run these convolutional neural networks at the ~90FPS framerate I need. There’s no way that a Raspberry pi, lacking any AI-compatible GPU hardware, would be able to run even a super stripped-down version of one of these AI algorithms. I could stream the video off the Pi to another computer, but real-time video streaming is very finnicky, with both latency and bandwidth limitations causing serious issues, especially at the high data rates required.

YOLO is really cool! But I don’t need all this functionality.

Luckily, I can avoid a complicated AI solution by looking to ‘old-school’ computer vision techniques. The first is background subtraction, which attempts to highlight any part of an image that is changing. In my case, the only things the camera sees that are moving are the LEGO parts. (Of course the belt is moving, but since it has a uniform color it doesn’t appear to be moving to the camera). Isolate those LEGO parts from the background, and I’m halfway there.

For background subtraction to work, the foreground objects have to be substantially different to the background in order to be picked up. LEGO parts come in a huge array of colors, so I need to choose the background color very specifically to be as un-LEGO as possible. That’s why the belt under the camera is made of paper — not only does it need to be very uniform, but it can’t be made of LEGO or it would be the same color as some of the bricks it needs to recognize! I chose a pale pink, but any other pastel color that isn’t similar to a common LEGO color would do.

The wonderful OpenCV library has a number of algorithms for background subtraction baked right in. The MOG2 background subtractor is the most sophisticated and works blazingly fast, even on the raspberry pi. However, feeding the video frames directly to MOG2 doesn’t quite work. Light gray and white pieces are too similar in brightness to the pale background and get lost. I needed a way to distinguish the belt more clearly from what was on it, by telling the background subtractor to look at *color* more carefully than *brightness*. All I need to do is increase the saturation of the image before passing it to the background subtractor. The results were substantially improved.

After the background subtraction, I need to use morphological operations to eliminate as much noise as possible. OpenCV’s findContours() functionality can be used to find the contours of the white regions. After applying some heuristics to discard contours that contain noise based on contour area, it’s a simple process to convert those contours to the final bounding boxes.

Performance

The neural network is a hungry beast. For the best possible classification results, it demands the highest-resolution images, and as many as possible. This means I need to capture at a very high framerate while also keeing image quality and resolution high. I’ll be pushing the camera, and the Raspberry PI GPU, to the absolute limit.
The superbly exhaustive picamera documentation shows that the V2 camera chip can output 1280x720 pixel images at a maximum of 90 frames per second. This is an incredible amount of data, and even though the camera can generate it, it doesn’t mean that the computer can deal with it. If I was processing raw 24-bit RGB images, that’s ~237 MB/s of bandwidth, way too much for the poor Pi’s GPU or SDRAM to deal with. Even using GPU-accelerated JPEG compression, 90fps is impossible to achieve.
The raspberry pi camera is able to output a raw unfiltered ‘YUV’ image. Even though it’s harder to work with than RGB, YUV actually has a lot of nice properties. Most importantly, it only has 12 bits per pixel (vs. 24 for RGB).

Every 4 ‘Y’ bytes has one ‘U’ and one ‘V’ byte — that comes out to 1.5 bytes per pixel.

This means I can process twice as many YUV frames as RGB frames, not even considering the extra time saved the GPU would otherwise have to spend encoding the RGB image.
The above places some pretty unique restrictions on the actual processing pipeline. Most operations on a full-size frame of video are going to be extremely memory and CPU intensive. Even decoding a full-size YUV frame is impossible within my strict time constraints.

Luckily, I don’t actually need to process the entire frame! Object detection doesn’t need bounding boxes to be exact, only fairly close — so the entire object detection pipeline can be done on a much smaller resized frame. The downscaling operation doesn’t need to consider all the pixels in the full-size frame, so, with care, the frame can be resized very quickly and cheaply. The resulting bounding boxes are then scaled back up and used to take crops out of the full-sized YUV frame. This way, I never need to decode or otherwise process the entire high-resolution frame.

Thankfully, because of how this YUV format is stored (see above), it’s actually very easy to implement fast crop and downscaling operations that work on the YUV format directly. Finally, the entire pipeline can be multithreaded across the Pi’s 4 cores without too much trouble (you do need a lock around the background subtraction step). I found that not all cores are used to their full extent however, indicating that the bottleneck is still memory bandwidth. Even so, I am able to achieve 70–80FPS in practice. More in-depth memory usage analysis could probably speed it up even further.

If you’re interested in learning more about the project, check out my previous article, How I created over 100,000 labeled LEGO Training Images.

You can follow me on Twitter: @JustASquid

--

--