Object Detection Fundamentals

Object Detection は、image 内の object に対して、class label + bounding box を予測する task です。

出力形式

output = [(class_id, x, y, w, h, score), ...]

class_id: object class
(x, y, w, h): bounding box (center / corner、normalized など表現は様々)
score: confidence

三つの大きな流派

流派	特徴
Two-stage	Region proposal → 分類 + box 回帰。精度は高いが遅め。
One-stage	全 pixel で直接予測。高速で real-time に強い。
DETR 系	Object query を set prediction で扱う。NMS 不要。

Anchor

古典的な one-stage detector は、画像中に anchor box (様々な aspect ratio / scale の box) を敷き、各 anchor に対して class と offset を予測します。Anchor 設計は精度に大きく影響しますが、最近の DETR 系や anchor-free detector では避けられる傾向があります。

NMS

複数の box が同じ object を指すことが多いため、Non-Maximum Suppression (NMS) で重複を除きます。

DETR 系は set prediction を使うため、NMS が不要になります。

IoU と AP

評価には次がよく使われます。

IoU (Intersection over Union): predicted box と ground truth box の重なり度合い
AP (Average Precision): precision-recall 曲線下の面積
mAP: class 平均の AP
AP50 / AP75: IoU 閾値別の AP

COCO benchmark の主指標が AP / mAP です。

数式で見る detection loss

Object detection は、bbox と class を同時に予測する問題です。Anchor または query $i$ に対して、class probability $\hat{p}_i$ 、bbox $\hat{\mathbf{b}}_i$ を予測し、ground truth class $y_i$ 、bbox $\mathbf{b}_i$ と比較します。

\mathcal{L}=\mathcal{L}_{\mathrm{cls}}(\hat{p}_i,y_i)+ \lambda\mathcal{L}_{\mathrm{box}}(\hat{\mathbf{b}}_i,\mathbf{b}_i)

Class loss には cross-entropy や focal loss が使われます。Focal loss は次のように書けます。

\mathcal{L}_{\mathrm{focal}}=-\alpha(1-p_t)^\gamma\log p_t

ここで、 $p_t$ は正解 class に対する予測確率です。この式の気持ちは、「簡単に分類できる大量の background example の重みを下げ、難しい example に学習を集中させる」というものです。

BBox regression では、単純な座標差だけでなく IoU 系 loss がよく使われます。

\mathrm{IoU}(A,B)=\frac{|A\cap B|}{|A\cup B|}

IoU は、予測 box と正解 box の重なり具合を直接測るため、localization quality と評価指標の対応が良くなります。

主なソース

Faster R-CNN: https://arxiv.org/abs/1506.01497
Mask R-CNN: https://arxiv.org/abs/1703.06870
RetinaNet (Focal Loss): https://arxiv.org/abs/1708.02002
COCO benchmark: https://cocodataset.org/

出力形式​

三つの大きな流派​

Anchor​

NMS​

IoU と AP​

数式で見る detection loss​

関連ページ​

主なソース​