The AI Illustrated Guide

Why is Object Detection so Messy?

TLDR: Neural networks have fixed sized outputs

Ygor Serpa
Towards Data Science
7 min readOct 1, 2020

Photo by You X Ventures on Unsplash.

Those working with Neural Networks know how complicated Object Detection techniques can be. It is no wonder there is no straight forward resource for training them. You are always required to convert your data to a COCO-like JSON or some other unwanted format. It is never a plug and play experience. Moreover, no diagram thoroughly explains Faster R-CNN or YOLO as there is for U-Net or ResNet. There are just too many details.

While these models are quite messy, the explanation for their lack of simplicity is quite straight forward. It fits in a single sentence:

Neural Networks have fixed-sized outputs

In object detection, you can’t know a priori how many objects there are in a scene. There might be one, two, twelve, or none. The following images all have the same resolution but feature different numbers of objects.

Photo by You X Ventures on Unsplash. Each image has a different number of objects.

The one million dollar question is: How can we build variable-sized outputs out of fixed-sized networks? Plus, how are we supposed to…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Ygor Serpa

Former game developer turned data scientist after falling in love with AI and all its branches.

Responses (2)

--

--