Depth Estimation

Depth Estimation は、image の各 pixel に対して camera から scene surface までの距離を推定する task です。3D Reconstruction、robotics、AR、autonomous driving、image editing などで重要です。近年は Depth Anything のような foundation model により、monocular depth の汎用性が大きく向上しています。

Depth と disparity

Depth $Z$ は camera から点までの距離です。Stereo camera では、left / right image の pixel shift である disparity $d$ から depth を求めます。

Z = \frac{fB}{d}

ここで、 $f$ は焦点距離、 $B$ は baseline です。Disparity が大きいほど、対象は camera に近くなります。

Stereo depth

Stereo depth は、二つの calibrated camera の image pair から depth を推定します。基本的には、epipolar line 上で対応点を探し、triangulation によって depth を得ます。

Stereo depth は metric scale を得やすい一方で、textureless region、反射面、occlusion に弱い場合があります。

Monocular depth

Monocular depth estimation は、単一 image から depth を推定します。単眼では幾何的には scale ambiguity があるため、深層学習による prior が重要になります。

Monocular depth には二種類の出力があります。

種類	意味
Relative depth	近い / 遠いの相対的な関係を推定します。Scale は任意です。
Metric depth	実際の距離単位に近い depth を推定します。

MiDaS や DPT は relative depth の代表的な model であり、ZoeDepth は relative depth と metric depth を組み合わせる方向の model です。

Multi-view depth

Multi-view depth は、複数 image と camera pose を使って、各 view の depth map を推定します。Photometric consistency や plane sweep、cost volume を使って、view 間で整合する depth を探します。

MVSNet のような method は、deep learning によって multi-view stereo の cost volume を学習します。

Depth Estimation の難所

Depth estimation では、次のような状況が難しくなります。

Textureless surface
Reflective / transparent object
Thin structure
Occlusion boundary
Dynamic object
Strong illumination change
Monocular scale ambiguity

数式で見る depth の不確実性

Stereo depth の式

Z=\frac{fB}{d}

は、depth が disparity の逆数として決まることを示しています。ここで、 $Z$ は depth、 $f$ は focal length、 $B$ は baseline、 $d$ は disparity です。

Disparity に小さな誤差 $\delta d$ があるとき、一次近似により depth error は次のようになります。

\delta Z \approx -\frac{fB}{d^2}\delta d = -\frac{Z^2}{fB}\delta d

この式の気持ちは、「遠い点ほど、同じ 1 pixel の disparity error が大きな depth error に増幅される」ということです。Baseline $B$ が大きいほどこの増幅は小さくなるため、stereo camera では baseline の設計が depth 精度に直接効きます。

Monocular depth の場合、単一画像だけでは絶対 scale が幾何的に決まりません。したがって、metric depth model が出す scale は、camera metadata、学習 data の prior、または追加 sensor によって支えられています。Relative depth を reconstruction に使う場合には、しばしば affine alignment を行います。

\min_{a,b} \sum_i \left(a\hat{Z}_i+b-Z_i^{\mathrm{ref}}\right)^2

ここで、 $\hat{Z}_i$ は model の予測 depth、 $Z_i^{\mathrm{ref}}$ は SfM、LiDAR、stereo などから得た reference depth、 $a,b$ は scale と shift です。この式は、「単眼 depth の形を保ちながら、既知の depth に合うように全体の scale と offset を合わせる」処理です。

主なソース

Eigen et al., “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network”, 2014: https://arxiv.org/abs/1406.2283
MiDaS paper: https://arxiv.org/abs/1907.01341
DPT paper: https://arxiv.org/abs/2103.13413
ZoeDepth paper: https://arxiv.org/abs/2302.12288
MVSNet paper: https://arxiv.org/abs/1804.02505

Depth と disparity​

Stereo depth​

Monocular depth​

Multi-view depth​

Depth Estimation の難所​

数式で見る depth の不確実性​

関連ページ​

主なソース​