
Average Precision (AP) and Mean Average Precision (mAP) are the most popular metrics used to evaluate object detection models, such as Faster R_CNN, Mask R-CNN, and YOLO, among others. The same metrics have also been used to evaluate submissions in competitions like COCO and PASCAL VOC challenges.
We can derive other metrics from AP. For example, mAP, AP50, AP75, and AP[.5:.5:.95] (see Figures 1 and 2 below).
In this article, we will go through these concepts so that, at the end of the day, you can understand them clearly. The following Figures (Fig 1 and Fig 2) show the usage of these metrics in some state-of-art (SOTA) models.

) and YOLOv3 (Source), respectively.](https://towardsdatascience.com/wp-content/uploads/2020/08/1OyDTMBek5ighamvwKxG9hg.png)

At a low level, evaluating the performance of an object detector boils down to determining if the detection is correct.
Definition of terms:
- True Positive (TP) – Correct detection made by the model.
- False Positive (FP) – Incorrect detection made by the detector.
- False Negative (FN) – A Ground-truth missed (not detected) by the object detector.
- True Negative (TN) -This is the background region correctly not detected by the model. This metric is not used in object detection because such regions are not explicitly annotated when preparing the annotations.
We need another helper metric called Intersection over Union (IoU) to define those terms in a more quantifiable way.
Intersection over Union (IoU)
IoU metric in object detection evaluates the degree of overlap between the ground(gt
) truth and prediction(pd
). The ground truth and the prediction can be of any shape-rectangular box, circle, or irregular shape). It is calculated as follows:

Diagrammatically, IoU is defined as follows (the area of the intersection divided by the area of union between ground-truth and predicted box.

IoU ranges between 0 and 1, where 0 shows no overlap, and 1 means perfect overlap between gt
and pd
. IoU metric is useful through thresholding; that is, we need a threshold (α, say) to determine whether detection is correct.
For the IoU threshold at α, True Positive(TP) is a detection for which IoU(gt,pd)≥ α and False Positive is a detection for which IoU(gt,pd)< α. False Negative is a ground truth missed together with gt
for which IoU(gt,pd)< α.
Is that clear? If not, the following Figure should make the definitions clearer.
Considering the IoU threshold, α = 0.5, then TP, FP and FNs can be identified as shown in Fig 4 below.

Note: If we raise the IoU threshold above 0.86, the first instance will be FP; if we lower the IoU threshold below 0.24, the second instance becomes TP.
Remark: The decision to mark a detection as TP or FP and ground-truth as FN is completely contingent on the choice of IoU threshold, α. For example, in the above Figure, if we lower the threshold below 0.24, then the detection in the second image becomes TP, and if the IoU threshold is raised above 0.86, the detection on the first image will be FP.
Precision and Recall
Precision is the degree of exactness of the model in identifying only relevant objects. It is the ratio of TPs over all detections made by the model.
Recall measures the ability of the model to detect all ground truths— proposition of TPs among all ground truths.

A model is said to be good if it has high precision and high recall. A perfect model has zero FNs and zero FPs (precision=1 and recall=1). Often, attaining a perfect model is not feasible.
Precision x Recall Curve (PR Curve)
Heads up: Before going on, look at this Figure and read its caption.

Just like IoU described earlier, confidence scoring also relies on the threshold. Raising the confidence score threshold means that more objects will be missed by the model (more FNs and, therefore, low recall and high precision), whereas a low confidence score will mean that the model gets more FPs (hence low precision and high recall). This means we need to develop some trade-offs for precision and recall.
The precision-recall (PR) curve is a plot of precision and recall at varying confidence values. For a good model, precision and recall stay high even when the confidence score varies.
Average Precision
AP@α is Area Under the Precision-Recall Curve(AUC-PR) evaluated at α IoU threshold. Formally, it is defined as follows.

Notation: AP@α or APα means that AP precision is evaluated at α IoU threshold. If you see metrics like AP50 and A75, they mean AP calculated at IoU=0.5 and IoU=0.75, respectively.
A high Area Under the PR Curve means high recall and high precision. Naturally, the PR curve is a zig-zag-like plot. That means that it is not monotonically decreasing. We can remove this property using interpolation methods. We will discuss two of those interpolation methods below:
- 11-point interpolation method
- All-point interpolation approach
11-point interpolation method
An 11-point AP is a plot of interpolated precision scores for model results at 11 equally spaced standard recall levels, namely, 0.0, 0.1, 0.2, . . . 1.0. It is defined as

where, R = {0.0, 0.1, 0.2, . . . 1.0} and

that is, interpolated precision at recall value, r – It is the highest precision for any recall value r’≥ r. If this one doesn’t make sense, I promise you it will make full sense once we go through an example.
All – point interpolation method In this case, interpolation is done for all the positions (recall values), that is,

Mean Average Precision (mAP)
Remark (AP and the number of classes): AP is calculated individually for each class. This means that there are as many AP values as the number of classes (loosely). These AP values are averaged to obtain the mean Average Precision (mAP) metric.
Definition: The mean Average Precision (mAP) is the average of AP values over all classes.

Remark (AP and IoU): As mentioned earlier, AP is calculated at a given IoU threshold, α. With this reasoning, AP can be calculated over a range of thresholds. Microsoft COCO calculated the AP of a given category/class at 10 different IoU ranging from 50% to 95% at 5% step-size, usually denoted AP@[.50:.5:.95]. Mask R-CNN reports the average of AP@[.50:.5:.95] simply as AP. It says
"We report the standard COCO metrics including AP (averaged over IoU thresholds), AP50,AP75, and APS, APM, APL(AP at different scales)" – Extract from Mask R-CNN paper
To make all these things clearer, let us go through an example.
Example
Consider the 3 images shown in Figure 5 below. They contain 12 detection (red boxes) and 9 ground truths (green). Each detection has a class marked by a letter and the model confidence. In this example, consider that all the detections are of the same object class, and the IoU threshold is set α = 50 per cent. IoU values are shown in Table 1 below.

Remark (Multiple detections): e.g. c,d in image 1, g,f in image 2 and i,k in image 3. For multiple detections like that, a detection with the highest confidence is labelled TP, and all other detections are marked as FPs, provided that the detection has an IoU≥threshold with the truth box. This means that:
- c and d become FPs because none of them meets the threshold requirement. c and d have 47% and 42% IoUs, respectively, against the required 50%.
- g is a TP, and f is FP. Both have IoUs greater than 50%, but g has a higher confidence of 97% against the confidence of 96% for f.
- What about i and k?
Multiple detections of the same object in an image were considered false detections e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections – Source: PASCAL VOC 2012 paper.
Some detectors can output multiple detections overlapping a single ground truth. For those cases the detection with the highest confidence is considered a TP and the others are considered as FP, as applied by the PASCAL VOC 2012 challenge. -Source: A Survey on Performance Metrics for Object-Detection Algorithms paper.

Important: Before filling up _cumTP, cumFP, alldetection, precision, and recall, you need to sort the table values by confidence in descending order. precision is cumTP/all_detections and recall is cumTP/number_of_ground_truths. We have nine ground truths.
11-point interpolation
To calculate the approximation of [email protected] using 11-point interpolation, we need to average the precision values for recall values in R (see Equation 3), that is, for recall values 0.0, 0.1, 0.2, . . . 1.0. as shown in Fig 6 Right below.



All-point interpolation
From the definition in Equation 5, we can calculate AP@50 using all-point interpolation as follows.



Simply put, all-point interpolation involves calculating and summing area values of the four regions (R1, R2, R3 and R4) in Figure 1.3b, that is,

And that is all!
Remark: Recall that we said AP is calculated for each class. Our calculations are for one class. For several classes, we can calculate mAP by simply taking the average of the resulting AP values for the different classes.
Here is another article on object detection metrics
Conclusion
After going through this article, I hope you understand AP, mAP and PR curves well. You have also noticed that IoU is an important concept because we can define none of these metrics without defining a threshold based on IoU. We have also learnt that AP is calculated per class, and the average of the resulting AP values is the mAP. AP can also be calculated for different IoU thresholds.
It is also important to note that the prediction mask may not necessarily be a rectangular box. It can come in any shape, specifically the shape of the detected object. The following Figure shows an example of detections being irregular in shape.

If you found this article helpful and would like to know about Python implementation, please write to me. Email on my profile.
More articles
References
Join medium at https://medium.com/@kiprono_65591/membership to get full access to every story on Medium.
You can also get the articles into your email inbox whenever I post using this link: https://medium.com/subscribe/@kiprono_65591
As always, thank you for reading 🙂