Open-Vocabulary 3D Perception

Open-Vocabulary 3D Perception は、固定 class set に縛られず、text query で 3D scene 内の概念を探したり segment したりする分野です。CLIP や vision-language model の表現を 3D に lift することで実現します。

固定 class の限界

従来の 3D semantic segmentation は、ScanNet 20 class や SemanticKITTI の class など、固定された label set を前提とします。しかし現実の robot は、未知の object や affordance を理解する必要があります。

例:

“a place to sit”
“red mug”
“something fragile”
“openable cabinet”

2D foundation model から 3D へ

OpenScene のような method は、2D image の CLIP feature と 3D point / voxel を対応づけ、3D point に language-aligned feature を持たせます。

3D Reconstruction での価値

Open-vocabulary 3D perception は、3D map を単なる geometry ではなく、言語で操作できる knowledge representation にします。

Robot に “go to the chair” と指示する
3D scene から “all mugs” を選択する
NeRF / 3DGS scene を text query で編集する
AR で object を自然言語検索する

数式で見る 3D feature と text の対応付け

Open-vocabulary 3D perception では、3D representation（point、voxel、Gaussian、surface feature）の各位置に embedding $\mathbf{f}_{3D}(\mathbf{x})\in\mathbb{R}^d$ を持たせ、text embedding $\mathbf{f}_{\mathrm{text}}(c)\in\mathbb{R}^d$ との similarity で class を決めます。

s(\mathbf{x},c)=\frac{\mathbf{f}_{3D}(\mathbf{x})^\top \mathbf{f}_{\mathrm{text}}(c)}{\|\mathbf{f}_{3D}(\mathbf{x})\|\|\mathbf{f}_{\mathrm{text}}(c)\|}

Query 集合 $\{c_k\}$ が与えられたとき、点 $\mathbf{x}$ の class 分布は softmax で書けます。

p(c_k\mid \mathbf{x})=\frac{\exp(s(\mathbf{x},c_k)/\tau)}{\sum_l\exp(s(\mathbf{x},c_l)/\tau)}

2D の CLIP-like feature を 3D に持ち上げる場合、複数 view からの観測 $\{\mathbf{f}_{2D}^{(i)}(\mathbf{u}_i(\mathbf{x}))\}$ を融合します。

\mathbf{f}_{3D}(\mathbf{x}) =\frac{\sum_i w_i(\mathbf{x})\mathbf{f}_{2D}^{(i)}(\mathbf{u}_i(\mathbf{x}))}{\sum_i w_i(\mathbf{x})}

各項の意味は次の通りです。

$\mathbf{u}_i(\mathbf{x})$ は、3D 点 $\mathbf{x}$ を view $i$ に投影した pixel 位置です。
$w_i(\mathbf{x})$ は visibility、occlusion、view angle に応じた重みです。
$\tau$ は softmax の温度で、似ている程度をどれくらい厳しく比べるかを決めます。

この式の気持ちは、「class label を固定せずに、text を query にした類似度として 3D segmentation や detection を行う」というものです。Query を変えるだけで予測対象を切り替えられる一方で、view 間で 2D feature が矛盾する場合の融合戦略が品質に大きく影響します。

主なソース

OpenScene, CVPR 2023: https://openaccess.thecvf.com/content/CVPR2023/html/Peng_OpenScene_3D_Scene_Understanding_With_Open_Vocabularies_CVPR_2023_paper.html
OpenIns3D: https://arxiv.org/abs/2309.00616
OpenSUN3D challenge: https://arxiv.org/abs/2402.15321

固定 class の限界​

2D foundation model から 3D へ​

3D Reconstruction での価値​

数式で見る 3D feature と text の対応付け​

関連ページ​

主なソース​

固定 class の限界

2D foundation model から 3D へ

3D Reconstruction での価値

数式で見る 3D feature と text の対応付け

関連ページ

主なソース