The world’s leading publication for data science, AI, and ML professionals.

Probability of accurate detection - An alternative metric for object detection

A metric for OD designed for practical applications, combining precision and recall in an intuitive manner

Photo by Joshua Sortino on Unsplash
Photo by Joshua Sortino on Unsplash

Introduction

Object detection neural networks are usually evaluated over the COCO dataset, using metrics such as mean average precision (mAP) and mean average recall (mAR). However such metrics might have limited meaning when deploying practical real-world applications. For example, consider a system composed of a camera feed and a robot that tags objects of interest. Let’s say your company makes money based on the number of objects tagged by the robot. To make it a little harder, let’s say there are several types of objects (several classes) and you are only interested in objects of a few specific classes. A solid model would ideally have high recall and high precision, i.e. , does not miss many objects and does not incorrectly tag many objects, respectively. Two questions arise when using metrics such as mAP and mAR:

  1. Does it make sense to use average values over several threshold values for precision and recall? Or would it be better to use a single threshold?
  2. Is precision or recall more important? How to weight between the two of them?

A first choice for a metric that combines both recall and precision is the F1 Score. The f1 score is defined as the harmonic mean between recall and precision. Harmonic mean? @#$ does that mean? Precisely… It is hard to understand the physical meaning of a metric such as f1 score, making it not such an obvious choice.

In light of these questions, let’s think of our original problem. What we really want to know is: given a certain number of targets, how many of those or what proportion are correctly identified? A number like 80–90% could be considered a good one. At this point, one might think this is just the recall metric. However, recall does not take false positives into account, so a model that tags everything it sees would have a recall of 100%, but is definitely a terrible model. In order to have a metric that’s truly meaningful for the original problem, I have derived a metric called probability of accurate detection, which we will discuss throughout the rest of this article.

Probability of accurate detection

In order to obtain the metric we are looking for, let’s think of this probabilistically. I call this metric probability of accurate detection, because it is exactly what it is: the probability of detecting the object (do not miss it), and at the same time, detecting it correctly (with the correct class). We could express this as:

, where D stands for a detection event, and A stands for accurate.

Ok, we already did half of the work by dividing this probability into two terms, P(D) and P(A|D). Now let’s figure out how to calculate them.

Important Definitions

To calculate the desired probabilities we need to know a few quantities, which can be experimentally determined by evaluating the model on a test dataset, using different thresholds. Each threshold will generate different numbers and they can be calculated separately for each class.

GT – Number of target ground truth objects

TP – True positives or number of correctly identified objects

FN – False negatives or number of objects missed by the model

FP -False positives or number of incorrectly identified objects (Background classified as a certain class)

LM _GT – Label mismatches with respect to the ground truth.

LM_D – Label mismatches with respect to the detection.

To clarify the difference between LM_GT and LM_D, suppose there are two classes, class 1 and class 2. If an object with a ground truth of class 1 is classified as class 2, it will count as one LM_GT for class 1 and one LM_D for class_2.

Probability of detection – P(D)

P(D) is the probability of not missing the object. Given the definitions provided in the last section, it can be calculated as:

This term is almost the same as the recall, with the addition of LM_GT in the numerator.

Probability of accurate given detection – P(A|D)

If you think carefully, this term is the definition of Precision. Therefore:

Full expression for probability of accurate detection

Now the final expression becomes:

Considering multiple frames

We are almost there, the derived expression works if we are analyzing single frames. However, if we are considering a video feed, the network might have N chances to detect the object. N can be calculated considering the inference time of the network and how long the object stays under the camera field of view. This is where it’s important to analyze the trade offs between more accurate vs faster models. Also check my post on speeding up inference with parallel model runs.

The final metric PADmf (I included mf to indicate multiple frames) can be calculated by considering that the object will be either detected on the first frame, or missed on the first frame then detected on the second, or missed on the second then detected on the third and so on…

There you go. For each class, all we need to do is select the threshold that maximizes PADmf and compare models based on this metric.


Related Articles