DUSt3R

DUSt3R (Dense and Unconstrained Stereo 3D Reconstruction) は、任意の二枚の imageから、両 view の dense 3D point map を同じ座標系で直接予測する geometry foundation model です。Camera intrinsics や pose は前提としません。

何を出力するのか

DUSt3R は、image pair $(I_1, I_2)$ を入力として、二つの pointmap を出力します。

X^{1,1}, X^{2,1} \in \mathbb{R}^{H \times W \times 3}

$X^{1,1}$ : image 1 の各 pixel に対応する、image 1 の camera coordinate の 3D point
$X^{2,1}$ : image 2 の各 pixel に対応する、image 1 の camera coordinateの 3D point

つまり、両 view の geometry が共通座標系で出てきます。

なぜ pointmap なのか

Depth map ではなく pointmap を出すことで、

Camera intrinsics を陽に知らなくてよい
View 間の対応が pointmap の距離として自然に出る
Camera pose は pointmap から PnP / least-squares で復元できる

という利点があります。Depth + intrinsics の組より、pointmap の方が geometry を直接表現します。

Architecture

DUSt3R は ViT encoder と transformer decoder を持ちます。Encoder は各 image を独立に処理し、decoder は cross-attention で view 間の情報をやり取りします。出力 head が pointmap を回帰します。

CroCo (Cross-view completion) という pretraining 由来の self-supervised representation を起点にしている点も特徴です。

どう downstream に使えるか

DUSt3R の pointmap からは、次が自然に派生します。

Camera pose: pointmap から PnP / Procrustes で復元
Depth: pointmap の Z 成分
Pixel matching: 同じ 3D 点を指す pixel 同士
Sparse / dense 3D reconstruction
Focal length 推定

Multi-view への拡張

DUSt3R 自体は pair-wise ですが、複数 image の pair-wise pointmap をglobal alignment することで、multi-view reconstruction を作れます。これによって、calibration なしの多視点 3D が現実的になります。

数式で見る pointmap regression

DUSt3R は、image pair $(I_1,I_2)$ から、それぞれの pixel に対応する 3D point を予測する pointmap regression として書けます。

\mathbf{P}_{1\to 1}, \mathbf{P}_{2\to 1}=f_\theta(I_1,I_2)

ここで、 $\mathbf{P}_{i\to 1}\in\mathbb{R}^{H\times W\times 3}$ は、view $i$ の各 pixel に対応する 3D 座標を view 1 の coordinate で表したものです。

学習では、ground truth pointmap $\mathbf{P}^*$ との差を、scale invariant な形で最小化します。

\mathcal{L} =\sum_{\mathbf{x}}\frac{1}{z}\left\|\mathbf{P}(\mathbf{x})-\mathbf{P}^*(\mathbf{x})\right\|

$z$ は scene scale を表す正規化定数です。この式の気持ちは、「camera pose と depth と correspondence をバラバラに推定するのではなく、共通座標系での 3D 座標を pixel ごとに直接予測する」というものです。

Camera pose や depth は、predicted pointmap からの後処理として復元できます。たとえば camera 中心は次の最小二乗で求まります。

\hat{\mathbf{c}}=\arg\min_{\mathbf{c}}\sum_{\mathbf{x}}\|\mathbf{P}(\mathbf{x})-\mathbf{c}\|^2-\bar{r}^2

Pointmap という共通表現を経由することで、calibration なしの multi-view reconstruction も同じ枠組みで扱えるようになります。

主なソース

DUSt3R paper: https://arxiv.org/abs/2312.14132
DUSt3R project page: https://dust3r.europe.naverlabs.com/
DUSt3R GitHub: https://github.com/naver/dust3r

何を出力するのか​

なぜ pointmap なのか​

Architecture​

どう downstream に使えるか​

Multi-view への拡張​

数式で見る pointmap regression​

関連ページ​

主なソース​