Triangulation, PnP, and ICP

Triangulation、Perspective-n-Point（PnP）、Iterative Closest Point（ICP）は、3D Reconstruction と pose estimation で頻繁に使われる基本 component です。SfM では、triangulation が sparse point を作り、PnP が新しい camera を登録し、Bundle Adjustment がその結果を全体として整えます。RGB-D や LiDAR の reconstruction では、ICP が point cloud や surface の位置合わせを担います。

Triangulation

Triangulation は、複数 view における対応点から 3D point を復元する処理です。二つ以上の camera pose と対応する image point が既知であるとき、それらの viewing ray の交点として 3D point を推定します。

理想的には ray は一点で交わりますが、実際には noise があるため、最小二乗問題として最も整合する 3D point を求めます。

Camera projection matrix を $\mathbf{P}_i = \mathbf{K}_i[\mathbf{R}_i \mid \mathbf{t}_i]$ とします。Homogeneous 3D point を $\tilde{\mathbf{X}}$ 、image $i$ 上の homogeneous pixel を $\tilde{\mathbf{x}}_i$ とすると、理想的には次の関係が成り立ちます。

\tilde{\mathbf{x}}_i \sim \mathbf{P}_i\tilde{\mathbf{X}}

$\sim$ は、homogeneous coordinate において scale を除いて等しいことを表します。Scale を消すために cross product を使うと、次の線形制約になります。

\tilde{\mathbf{x}}_i \times (\mathbf{P}_i\tilde{\mathbf{X}}) = \mathbf{0}

この式の気持ちは、「推定した 3D point を camera $i$ に投影した方向と、実際に観測された image point の方向が一致してほしい」ということです。複数 view の式を積み重ねると、 $\mathbf{A}\tilde{\mathbf{X}} = \mathbf{0}$ という形になります。Linear triangulation では、 $\|\mathbf{A}\tilde{\mathbf{X}}\|$ を最小にする解を SVD で求めます。

ただし、linear triangulation が最小化しているのは algebraic error であり、画像上の reprojection error そのものではありません。より精密には、次の nonlinear least squares を解きます。

\min_{\mathbf{X}} \sum_i \left\|\mathbf{x}_i - \pi(\mathbf{P}_i\tilde{\mathbf{X}})\right\|^2

この式の気持ちは、「すべての view に投影し直したときに、観測された pixel にできるだけ近くなる 3D point を選ぶ」というものです。SfM では、まず linear triangulation で初期値を作り、その後に Bundle Adjustment で camera と点を一緒に最適化することが多いです。

Triangulation angle と depth uncertainty

Triangulation は、二つの viewing ray の交差角に強く依存します。交差角が小さいと、ray がほぼ平行になり、画像上の小さな誤差が奥行き方向の大きな誤差になります。

Rectified stereo の場合、depth $Z$ は focal length $f$ 、baseline $B$ 、disparity $d$ によって次のように書けます。

Z = \frac{fB}{d}

Disparity error を $\sigma_d$ とすると、depth error はおおよそ次のようになります。

\sigma_Z \approx \frac{Z^2}{fB}\sigma_d

この式は、「遠い点ほど depth が不安定になり、baseline が長いほど depth が安定する」ことを表しています。SfM の image pair selection で、十分な baseline と十分な overlap の両方が必要になるのはこのためです。

Triangulation は、Structure from Motion や SLAM の map point 生成に使われます。

Perspective-n-Point（PnP）

PnP は、既知の 3D point と、それに対応する 2D image point から、camera pose を推定する問題です。

\mathbf{x}_i \sim \mathbf{K}[\mathbf{R} \mid \mathbf{t}]\tilde{\mathbf{X}}_i

より最適化問題として書くと、PnP は次の reprojection error を小さくする $\mathbf{R}, \mathbf{t}$ を探します。

\min_{\mathbf{R},\mathbf{t}} \sum_{i=1}^{n} \rho\left( \left\|\mathbf{x}_i - \pi\left(\mathbf{K}(\mathbf{R}\mathbf{X}_i + \mathbf{t})\right)\right\|^2 \right)

ここで、 $\mathbf{X}_i$ は world coordinate の 3D point、 $\mathbf{x}_i$ は image 上の対応点、 $\rho$ は robust loss です。この式の気持ちは、「camera pose を仮定したときに、既知の 3D point が観測された 2D point に重なるようにしたい」というものです。

PnP は、既存 map に対して新しい frame の camera pose を推定する場面で重要です。SLAM では tracking の中心的な処理の一つです。Incremental SfM では、すでに登録済みの 3D point と新しい image の feature match を使って、新しい camera を map に追加します。

実用上は、RANSAC と PnP solver を組み合わせて outlier に頑健に pose を推定します。RANSAC は、少数の対応から pose candidate を作り、その pose に対して reprojection error が小さい対応を inlier として数えます。最終的には、inlier だけを使って pose を nonlinear refinement することが多いです。

PnP の degeneracy

PnP は、3D point の配置が悪いと不安定になります。たとえば、点がほぼ一直線上にある場合や、ほぼ一枚の平面上にしかない場合には、pose の一部が曖昧になりやすいです。また、image 上で対応点が狭い範囲に固まっている場合も、rotation や translation が不安定になります。

この難しさは、objective の情報量が不足していることとして理解できます。Pose を安定に推定するためには、視野の広い範囲に分布した 2D-3D correspondence が必要です。

ICP

ICP、つまり Iterative Closest Point は、二つの point cloud を align するための iterative algorithm です。Source point cloud を $\{\mathbf{p}_i\}$ 、target point cloud を $\{\mathbf{q}_j\}$ とし、source を target に合わせる rigid transform $\mathbf{R}, \mathbf{t}$ を求めます。

大まかには、次の手順を繰り返します。

一方の point cloud の各点に対して、もう一方の point cloud の最近傍点を探します。
対応点間の距離が最小になる rigid transform を推定します。
Transform を適用し、収束するまで繰り返します。

Point-to-point ICP の objective は次のように書けます。

\min_{\mathbf{R},\mathbf{t}} \sum_i \left\|\mathbf{R}\mathbf{p}_i + \mathbf{t} - \mathbf{q}_{c(i)}\right\|^2

ここで、 $c(i)$ は source point $\mathbf{p}_i$ に対応する target point の index です。この式の気持ちは、「source の各点を回転・平行移動した後に、対応する target 点へできるだけ近づけたい」ということです。

Surface normal がある場合は、point-to-plane ICP がよく使われます。Target point $\mathbf{q}_{c(i)}$ の normal を $\mathbf{n}_{c(i)}$ とすると、objective は次のようになります。

\min_{\mathbf{R},\mathbf{t}} \sum_i \left( \mathbf{n}_{c(i)}^\top(\mathbf{R}\mathbf{p}_i + \mathbf{t} - \mathbf{q}_{c(i)}) \right)^2

この式は、点と点の三次元距離そのものではなく、target surface の normal 方向のずれを小さくします。つまり、「source の点を target surface の接平面へ近づける」ことを重視します。Surface が滑らかで初期位置がある程度近い場合、point-to-plane ICP は point-to-point ICP より速く収束することがあります。

ICP は、RGB-D SLAM、LiDAR odometry、point cloud registration、surface reconstruction でよく使われます。ただし、ICP は局所最適に落ちやすいため、良い初期値、外れ値除去、downsampling、normal estimation、multi-scale strategy が重要です。

三つの役割の違い

手法	入力	出力	主な用途
Triangulation	Camera pose と 2D-2D correspondence	3D point	Sparse map の生成
PnP	3D point と 2D observation	Camera pose	新しい camera / frame の登録
ICP	3D-3D correspondence または nearest neighbor	Rigid transform	Point cloud / surface の位置合わせ

この三つは、互いに逆向きの問題として見ると理解しやすいです。Triangulation は camera が分かっている状態で点を求めます。PnP は点が分かっている状態で camera を求めます。ICP は両方が三次元で表されている状態で、座標系のずれを求めます。

主なソース

OpenCV calib3d module documentation: https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html
Hartley and Zisserman, “Multiple View Geometry in Computer Vision”: https://www.robots.ox.ac.uk/~vgg/hzbook/
Szeliski, “Computer Vision: Algorithms and Applications”: https://szeliski.org/Book/
Besl and McKay, “A Method for Registration of 3-D Shapes”, 1992: https://ieeexplore.ieee.org/document/121791

Triangulation​

Triangulation angle と depth uncertainty​

Perspective-n-Point（PnP）​

PnP の degeneracy​

ICP​

三つの役割の違い​

関連ページ​

主なソース​