Vision-Language-Action Models

Vision-Language-Action (VLA) model は、image / video と language instruction を入力し、robot action を出力する model です。Vision-language model を robot control へ拡張する流れとして、近年の Robotics / AI conference で非常に重要です。

基本形式

a_t = \pi(o_t, \ell)

ここで、 $o_t$ は visual observation、 $\ell$ は language instruction、 $a_t$ は robot action です。

RT-2

RT-2 は、vision-language model を robot action token へ fine-tuning する方向の代表例です。Web-scale visual-language pretraining で得た知識を、robot action に転移することを狙います。

OpenVLA

OpenVLA は、open-source な VLA model として注目されます。大規模な robot demonstration data を使い、visual observation と language instruction から action を生成します。

π0 / generalist robot policy

π0 (pi-zero) 系の研究は、より汎用的な robot policy を目指します。多様な robot、task、scene を横断する policy を作るために、大規模 demonstration data と transformer / flow matching 系の action modeling が使われます。

World model との違い

VLA model は直接 action を出します。一方で world model は future を予測します。

Model	主な出力	強み
VLA	Action	End-to-end control
World model	Future state	Planning / simulation
JEPA	Future representation	Abstract predictive representation

実用上は、VLA と world model を組み合わせる方向が重要です。

数式で見る VLA policy

Vision-Language-Action model は、画像観測 $I_t$ 、言語指示 $q$ 、履歴 $h_t$ から action を出す policy として書けます。

a_t\sim\pi_\theta(a_t\mid I_t,q,h_t)

Action が連続値の場合は、policy を Gaussian として出すことがあります。

\pi_\theta(a_t\mid I_t,q,h_t)=\mathcal{N}(\boldsymbol{\mu}_\theta(I_t,q,h_t),\boldsymbol{\Sigma}_\theta(I_t,q,h_t))

Action を token 化する場合は、LLM と同じ next-token prediction で学習できます。

\mathcal{L}_{BC}=-\sum_t\log p_\theta(a_t\mid a_{<t},I,q)

この式の気持ちは、「言語で指定された task を、画像から見える状態に合わせて行動列へ翻訳する」というものです。VLA の難しさは、language grounding、visual perception、low-level control の三つを同じ policy に統合する点にあります。

主なソース

RT-2: https://arxiv.org/abs/2307.15818
OpenVLA: https://arxiv.org/abs/2406.09246
π0: https://www.physicalintelligence.company/blog/pi0

基本形式​

RT-2​

OpenVLA​

π0 / generalist robot policy​

World model との違い​

数式で見る VLA policy​

関連ページ​

主なソース​