Embodied AI Overview

Embodied AI は、agent が物理世界の中で観測し、理解し、計画し、行動するための研究領域です。Computer Vision、Robotics、Language Models、World Models が交差します。

基本 loop

必要な component

Component	例
Perception	Depth Anything、SAM、VGGT、object detection
State representation	3D map、object state、language-grounded scene graph
World model	Future state prediction、latent dynamics、V-JEPA
Policy	Diffusion Policy、VLA model、RL policy
Actuation	Robot arm、mobile base、humanoid

3D / 4D との関係

Embodied agent は、flat image だけでは行動できません。物体の位置、形状、pose、可動性、接触可能性、未来の変化を理解する必要があります。そのため、3D Reconstruction、4D Reconstruction、Pose Estimation、World Models は Embodied AI の基盤になります。

数式で見る embodied agent

Embodied AI は、視覚・言語・行動を統合した agent が、環境内で逐次意思決定を行う問題として書けます。観測 $o_t$ 、言語目標 $g$ 、履歴 $h_t$ に対して、policy は action を選びます。

a_t\sim\pi_\theta(a_t\mid o_t,g,h_t)

環境は action に応じて遷移します。

s_{t+1}\sim P(s_{t+1}\mid s_t,a_t), \qquad o_t\sim O(o_t\mid s_t)

目的は task success と安全性を両立することです。

\max_\pi\;\mathbb{E}\left[\sum_t \gamma^t r(s_t,a_t,g)\right]

この式の気持ちは、「画像を理解するだけではなく、目標を解釈し、環境に働きかけ、その結果を見て次の行動を決める」という closed-loop intelligence を扱うということです。Embodied AI では、perception error が action error に直結するため、offline benchmark だけでなく実環境での評価が重要になります。

基本 loop​

必要な component​

3D / 4D との関係​

数式で見る embodied agent​

関連ページ​

基本 loop

必要な component

3D / 4D との関係

数式で見る embodied agent

関連ページ