4D Reconstruction Overview

4D Reconstruction は、時間とともに変化する 3D scene を復元する分野です。通常の 3D Reconstruction が「静的な三次元形状」を扱うのに対して、4D Reconstruction は geometry + appearance + motion + time を同時に扱います。

4D Reconstruction task map

自作概念図。4D Reconstruction は、video input から canonical space、deformation field、time-varying representation を学習し、novel view / novel time rendering を行います。

何が 4D なのか

ここでの 4D は、空間 3 次元 $(x, y, z)$ に時間 $t$ を加えたものです。

\text{4D scene} = (x, y, z, t)

Output は、ある時刻 $t$ における 3D shape だけではなく、時間方向の変形、appearance の変化、object motion も含みます。

入力の種類

4D Reconstruction の入力は、主に次の三種類です。

入力	説明
Multi-view synchronized video	複数 camera が同期して同じ dynamic scene を撮影します。最も情報量が多いです。
Monocular video	単一 camera の video から dynamic scene を復元します。最も ill-posed です。
RGB-D / LiDAR sequence	Depth や point cloud sequence が直接得られます。Robotics / autonomous driving で多いです。

代表的な approach

4D Reconstruction では、次のような approach がよく使われます。

Approach	代表的な考え方
Dynamic NeRF	Canonical NeRF と deformation field で time-varying scene を表す。
4D Gaussian Splatting	3D Gaussian を時間方向へ拡張し、高速 rendering を狙う。
Deformation field	Canonical space から各時刻への warp を学習する。
Human avatar	人体の pose / shape prior を使い、動く人間を復元する。
Scene flow	3D 点の時間方向 motion を推定する。

なぜ難しいのか

4D Reconstruction は、3D Reconstruction よりも難しいです。理由は次の通りです。

Camera motion と object motion を分離する必要がある。
Occlusion / disocclusion が時間とともに発生する。
Monocular video では、depth と motion の ambiguity が強い。
非剛体変形 (cloth, face, human body) は rigid geometry では表せない。
時間方向に smooth でありながら、速い motion も表現する必要がある。

数式で見る 4D reconstruction

4D reconstruction は、時間 $t$ に依存する scene geometry と appearance を復元する問題です。静的な 3D 表現 $\mathbf{S}$ の代わりに、時間付きの表現 $\mathbf{S}(t)$ を考えます。

I_{v,t}\approx R(\mathbf{S}(t),\mathbf{K}_{v,t},\mathbf{T}_{v,t})

ここで、 $I_{v,t}$ は時刻 $t$ ・view $v$ の観測画像です。この式の気持ちは、「各時刻の scene を別々に復元するのではなく、時間的につながった一つの動的 scene として説明したい」ということです。

多くの method では、canonical space の点 $\mathbf{x}_c$ を deformation field で時刻 $t$ の空間へ移します。

\mathbf{x}_t=\mathbf{x}_c+\Delta_\theta(\mathbf{x}_c,t)

$\Delta_\theta$ は変形量です。Temporal regularization を入れる場合は、次のように時間方向の急激な変化を抑えます。

\mathcal{L}_{\mathrm{temp}}=\sum_t\|\mathbf{S}(t+1)-\mathbf{S}(t)\|^2

ただし、実際には topology change や occlusion があるため、単純に滑らかにすればよいわけではありません。

主なソース

Deformable 3D Gaussians, CVPR 2024: https://openaccess.thecvf.com/content/CVPR2024/html/Yang_Deformable_3D_Gaussians_for_High-Fidelity_Monocular_Dynamic_Scene_Reconstruction_CVPR_2024_paper.html
4D Gaussian Splatting paper: https://arxiv.org/abs/2402.03307
Dynamic NeRF survey: https://arxiv.org/abs/2210.12272

何が 4D なのか​

入力の種類​

代表的な approach​

なぜ難しいのか​

数式で見る 4D reconstruction​

関連ページ​

主なソース​