Depth Anything V1

Depth Anything V1 は、「任意の image に対して robust に動く monocular depth model」を作ることを目的とした foundation model です。中心的な idea は、新しい architecture を作ることではなく、大量の unlabeled image を活用することです。

問題意識

Monocular depth estimation は、本質的に scale ambiguity を含む難しい task です。さらに、

Labeled depth dataset は規模が限られる (sensor 付きの撮影が必要)
多くの dataset は indoor または driving のような特定 domain
そのため model が domain shift に弱くなりがち

という問題があります。

V1 の戦略

V1 は、この問題に対して データ規模 で対抗します。

Labeled depth data で base になる teacher を train
大量の unlabeled image に対して teacher で pseudo depth label を生成
Labeled + pseudo-labeled の combined dataset で student model を train

これは、self-training の発想を depth estimation に適用したものです。

なぜ大量データが効くのか

Depth は、純粋な幾何ではなく prior にも強く依存します。「人がこれくらいの大きさで写っていれば、典型的な distance はこれくらい」「室内シーンの天井はだいたいこのくらい」のような prior は、たくさんの画像を見ることで得られます。

大量の unlabeled image を使うと、

多様な scene layout
多様な camera (focal length、aspect ratio)
多様な lighting

をカバーでき、結果として domain robustness が向上します。

V1 の貢献の要点

整理すると、V1 の貢献は次のとおりです。

Monocular depth を「狭い domain の supervised problem」から「大規模 self-training problem」へ位置づけ直した
適切な data engineering で、relative depth model としての汎用性が大きく向上した
後続の Depth Anything V2 / V3 の出発点となった

Relative depth か metric depth か

V1 は基本的に relative depth を出します。Metric depth が必要な場合は、別途 metric depth dataset で fine-tuning します。詳しくは Relative vs Metric Depth を参照してください。

主なソース

Depth Anything paper: https://arxiv.org/abs/2401.10891

問題意識​

V1 の戦略​

なぜ大量データが効くのか​

V1 の貢献の要点​

Relative depth か metric depth か​

関連ページ​

主なソース​