Grounding DINO and Open-Vocabulary Detection

Open-Vocabulary Object Detection (OVD) は、固定 class set ではなく、任意の text query で物体を検出する task です。Grounding DINO、OWL-ViT、GLIP などが代表的で、CLIP のような vision-language model を detection に接続します。

何が「open-vocabulary」なのか

通常の object detector は、COCO の 80 class、LVIS の 1200 class のような固定 set で学習されます。OVD は、

学習に含まれていない class
自由形式の text phrase (例: "a red mug on the table")

でも検出できることを目指します。

Grounding DINO

Grounding DINO は、Transformer-based の DINO detector に、CLIP 系 text encoder を結合した OVD model です。

特徴は、

Text と image の deep fusion (encoder と decoder の両方で cross-attention)
自由形式の phrase で検出可能
Detection と grounding を統合的に扱える

という点です。

OVD の代表例

Model	特徴
OWL-ViT / OWLv2	ViT + CLIP-style alignment による simple OVD
GLIP	Grounded language-image pre-training
Grounding DINO	DINO detector + deep fusion で強力
YOLO-World	YOLO ベースで real-time OVD
OmDet / DetCLIP	Large-scale grounded pretraining

SAM との組み合わせ

OVD と SAM を組み合わせると、任意 text → bounding box → mask の pipeline が作れます (例: Grounded SAM)。

これは、データセット作成、3D scene 編集、robotics の object 指定など、多くの応用で使われます。

数式で見る open-vocabulary detection の対応付け

Open-vocabulary detector では、object query $\mathbf{q}_i$ と text query 中の word embedding $\mathbf{t}_c$ の内積を class logit として使います。

s_{ic}=\mathbf{q}_i^\top\mathbf{t}_c

固定 vocabulary の classifier の代わりに、text encoder から得た embedding を直接 class prototype として使うため、語彙を入れ替えるだけで未知 class にも適用できます。

学習時には、検出 query と text の対応を Hungarian matching で決め、それぞれに対して localization loss と alignment loss を組み合わせます。

\mathcal{L}=\mathcal{L}_{\mathrm{align}}(s_{ic},y_{ic}) +\lambda_{\mathrm{box}}\mathcal{L}_{\mathrm{box}}(\hat{\mathbf{b}}_i,\mathbf{b}_i) +\lambda_{\mathrm{GIoU}}\mathcal{L}_{\mathrm{GIoU}}(\hat{\mathbf{b}}_i,\mathbf{b}_i)

ここで、 $\mathcal{L}_{\mathrm{align}}$ は対応する word に対する binary cross-entropy または contrastive loss です。この式の気持ちは、「box を絞り込む幾何的な loss と、text に対応付ける言語的な loss を同時に学ぶ」というものです。

推論では、free-form text prompt を tokenize し、word embedding を class prototype として box ごとに similarity を計算します。

p(c\mid i)=\sigma(s_{ic})

この式は、語彙固定の softmax ではなく、各 word を独立な「あり / なし」判定として扱うことを表します。これにより、複数の phrase を同時に検出 query として渡しても、互いに排他的にならずに扱えます。

主なソース

Grounding DINO: https://arxiv.org/abs/2303.05499
OWL-ViT: https://arxiv.org/abs/2205.06230
OWLv2: https://arxiv.org/abs/2306.09683
GLIP: https://arxiv.org/abs/2112.03857
YOLO-World: https://arxiv.org/abs/2401.17270

何が「open-vocabulary」なのか​

Grounding DINO​

OVD の代表例​

SAM との組み合わせ​

数式で見る open-vocabulary detection の対応付け​

関連ページ​

主なソース​