Depth Anything Overview

Depth Anything は、任意の image に対して robust に depth を推定する monocular depth foundation model family です。Depth Anything V1 / V2 は monocular depth estimation を中心とし、Depth Anything 3 では single image だけでなく任意 view 数の input から spatially consistent な geometry を推定する方向へ拡張されています。

Depth Anything concept

自作概念図。Depth Anything 系は labeled data と large-scale unlabeled data / pseudo label を組み合わせ、relative depth、metric depth、any-view geometry へ展開します。

なぜ「Anything」なのか

Depth estimation の研究は古くからありますが、多くの model は

特定 dataset (例: NYU、KITTI) で training
その dataset と似た条件 (室内、屋外、車載) では強い
それ以外の domain では性能が大きく落ちる

という問題を抱えていました。Depth Anything は、

多様な domain の画像で動く
任意の image を入力できる
Relative depth を foundation prior として供給する

ことを目指して設計されています。

Family の発展

世代	主な拡張
Depth Anything V1	大規模 unlabeled data を pseudo-label して monocular depth foundation model を学習
Depth Anything V2	Synthetic labeled data + 大きな teacher + 大量 pseudo label。Fine detail と robustness を改善
Depth Anything 3	Any-view input から spatially consistent な geometry。Monocular だけでなく multi-view にも対応

3D Reconstruction での位置づけ

Depth Anything は、3D Reconstruction の中で次のように位置します。

Image editing や diffusion control の depth condition
NeRF / 3DGS training の depth supervision
Sparse SfM が不安定な scene での dense prior
VGGT のような geometry model と並ぶ、軽量で扱いやすい prior

全体像のまとめ

詳細は次のページで扱います。

ページ	内容
Depth Anything V1	大規模 unlabeled data によるロバスト monocular depth
Depth Anything V2	Synthetic + pseudo-label による fine-grained depth
Depth Anything 3	Any-view geometry foundation model
Relative vs Metric Depth	Relative depth と metric depth の違いと使い分け
Depth Anything in 3D Reconstruction Pipelines	NeRF / 3DGS / SfM との組み合わせ

数式で見る scale-and-shift invariant な学習

Depth Anything のような large-scale monocular depth model では、学習データの depth scale が dataset ごとに大きく違います。そのため、scale-and-shift invariant な loss がよく使われます。予測 depth を $\hat{d}_i$ 、reference depth を $d_i$ とし、最適な affine 変換を解いてから比較します。

(a^*,b^*)=\arg\min_{a,b}\sum_i(a\hat{d}_i+b-d_i)^2

この最適解で残差を測ったものが loss です。

\mathcal{L}_{\mathrm{ssi}} =\frac{1}{N}\sum_i\left((a^*\hat{d}_i+b^*)-d_i\right)^2

この式の気持ちは、「絶対 scale が揃わない多数の dataset を一緒に学習するために、各 sample で scale と shift を吸収してから depth の形だけを比較する」というものです。

加えて、勾配を合わせる multi-scale gradient loss を入れることもあります。

\mathcal{L}_{\mathrm{grad}} =\sum_s\sum_i\left|\nabla\hat{d}_i^{(s)}-\nabla d_i^{(s)}\right|

これは、「絶対値だけでなく、edge や形状の変化を合わせる」ことを促す項です。

主なソース

Depth Anything paper: https://arxiv.org/abs/2401.10891
Depth Anything V2 project page: https://depth-anything-v2.github.io/
Depth Anything V2 paper: https://arxiv.org/abs/2406.09414
Depth Anything 3 GitHub repository: https://github.com/ByteDance-Seed/Depth-Anything-3
Depth Anything 3 paper: https://arxiv.org/abs/2511.10647

なぜ「Anything」なのか​

Family の発展​

3D Reconstruction での位置づけ​

全体像のまとめ​

数式で見る scale-and-shift invariant な学習​

関連ページ​

主なソース​