In this blog post, I share an initial idea on how to localize objects more precisely (compared to the bounding box) with no significant change to the deep learning model training pipeline
"Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them." J. Redmon
When I read these words in "YOLOv3: An Incremental Improvement" paper I thought it’s maybe too sharp to say this about such a common structure as a bounding box. What’s the problem with it? In fact, object detection and instance segmentation models are created to solve different tasks, aren’t they? That’s correct, but up to a point. They both aim to localize a certain object but with a different level of precision. A segmentation mask aims to capture an object’s shape regardless of its complexity. Bounding boxes are much more simple. To expand the word ‘stupid’ I would say that bounding boxes do not tell anything about the shape of an object, its actual occupied area, and bounding box very often capture too much background noise.

Sometimes it may be a big deal for a Computer Vision solution. If an object’s cropped image is then used by another model in the pipeline, mentioned drawbacks of the bounding box may significantly impact the performance. On the other hand, object detection networks are often used on edge devices with the purpose of real-time processing. It becomes a non-trivial task to run complex per-pixel segmentation neural networks in the edge environment. Here we face the problem: we’d like to extract something more than just a simple rectangle however models which can do it require non-affordable computational resources. Knowing all this, the formulation of the question will be the following: can we somehow extend object detection so that it can find an object with the precision of segmentation level but at the same time keep all advantages of real-time performance which we have with lightweight object detection networks?
Let’s think about it. Polygon shape could be a good replacement for a bounding box. However, the number of polygon components is varying and it depends on shape complexity. So, a polygon cannot be a direct output of the neural network which has fixed output dimensions. As far as I can see, there are many directions of research to solve polygon prediction problems in deep learning. The practical appliance of these models really depends on the project specifics. I’d like to highlight one of such approaches which I’ve faced recently – PolarMask. Before we dive into details I’d like to recap some basics of polar representation which PolarMask is based on. Instead of x and y absolute values, the point is represented by angle and distance.

Here is the idea of PolarMask: let’s find a center of an object and cast a set of rays from it with some interval (e.g. 10 degrees, 360/10=36 rays in total). Points, where rays intersect with the contour of the object, are points of our target polygon. The number of such points is fixed because the angles are pre-defined. The only thing which the model needs to predict is the distance from the origin (center). Basically, the problem of a varying number of polygon components is solved here in a quite accurate way. Now we can convert our target data from undefined and diverse dimensions to simple and uniform.

Another piece of information that is missing for the algorithm to work is a center of an object which is also expected in the model output. Basically, it is just a regression of two values. However, the tricky part here is which point exactly we take as a center. The point is that the middle coordinates are not the best option in the case of free-form objects. It is better to take the "center of mass" as a ground truth point. It will give a better probability that all rays reach their best intersection points.
So, the model output for each object is center coordinates and vector of distances for an angle grid.
Why do I personally find the above-mentioned idea interesting:
- From an inference time perspective, it is almost the same as bounding box regression, we just have a bit more values to regress. Moreover, a bounding box can be considered a special case of a polygon with 4 elements. In other words, we get much more complex and generalized predictions almost for free.
- Polar representation of polygon can be applied to a big variety of neural network architectures because it’s just a flexible way of building model output. For example, well-known object detection architectures such as YOLO or FCOS can be modified to produce object polygon instead of bounding box with not so big effort.
- It gives an output structure that is in the middle between object detection and instance segmentation. Therefore, it can be a compromise solution with no need to significantly alter the entire pipeline for per-pixel segmentation.
- It’s cool to see when practical solutions appear in Deep Learning, which is overwhelmed with theoretical talking and benchmark fighting. It is really exciting when engineering minds come to such ideas and connect different pieces of human knowledge into a useful and working solution.
In the next article, I will tell about my own experiments using this approach. Nothing really groundbreaking, but I wanted to check how easy is to implement such a polygon regression from scratch. Let’s keep in touch.