The AI Illustrated Guide
Why is Object Detection so Messy?
TLDR: Neural networks have fixed sized outputs
Those working with Neural Networks know how complicated Object Detection techniques can be. It is no wonder there is no straight forward resource for training them. You are always required to convert your data to a COCO-like JSON or some other unwanted format. It is never a plug and play experience. Moreover, no diagram thoroughly explains Faster R-CNN or YOLO as there is for U-Net or ResNet. There are just too many details.
While these models are quite messy, the explanation for their lack of simplicity is quite straight forward. It fits in a single sentence:
Neural Networks have fixed-sized outputs
In object detection, you can’t know a priori how many objects there are in a scene. There might be one, two, twelve, or none. The following images all have the same resolution but feature different numbers of objects.
The one million dollar question is: How can we build variable-sized outputs out of fixed-sized networks? Plus, how are we supposed to…