Web and Computer-Use Agents

Web / Computer-Use Agents は、人間のように GUI を操作する agent です。Browser を navigate したり、OS の任意 application を click / type したりして task を遂行します。AI agent の中で最も「身体性 (embodiment)」に近い領域です。

何ができるのか

Web 検索、フォーム入力、商品購入
メール送信、カレンダー登録
スプレッドシート編集
Web app 操作
デスクトップアプリの GUI 操作
スクリーンショットからの読み取りと判断

Web Agent

Web agent は HTML DOM、accessibility tree、screenshot を入力として、

要素を選んで click
フィールドに type
スクロール、リンク遷移
フォーム送信

します。代表 benchmark: WebArena、VisualWebArena、Mind2Web。

Computer-Use Agent

Computer-use agent は、ブラウザに限らず OS 全体 を操作します。

スクリーンショット → vision-language model で UI を理解
マウス / キーボードイベントを発行
任意 application を扱える

代表 system:

System	特徴
Claude Computer Use	Anthropic、Claude が screen を見て click
OpenAI Operator	OpenAI の web automation agent
Browser-Use	OSS の browser agent framework
Google Project Mariner	Chrome 内 web agent

入力 modality の選択

入力	利点	欠点
DOM / HTML	構造化、堅牢	アプリ依存、外部アプリ不可
Accessibility tree	クリーンな抽象、screen reader 用	一部 web で不完全
Screenshot + VLM	任意 GUI	OCR / 要素特定が誤りやすい
Hybrid	強い	設計が複雑

何が難しいか

長い操作列: 数十〜数百 step
動的 UI: ローディング、modal、Ajax
Pop-up / CAPTCHA: 想定外イベント
GUI の曖昧さ: 似たボタンが複数
副作用: 購入、送信のような取り返しのつかない操作
Security: prompt injection が web ページに埋め込まれる

安全と guardrails

実用 deploy では、

重要操作の human approval
限定 domain の allowlist
仮想環境 / sandbox
認証情報の隔離
Action log と audit
Injection 対策 (page content を信用しない)

が必須です。

数式で見る UI agent の観測と行動

Web / computer-use agent は、画面観測 $o_t$ と内部履歴 $h_t$ から、click、type、scroll などの action を選びます。

a_t\sim\pi_\theta(a_t\mid h_t,o_t)

画面状態は action によって遷移します。

o_{t+1}=E(o_t,a_t)

ここで、 $E$ は browser や OS を含む環境です。この式の気持ちは、「UI 操作は一回の回答ではなく、観測して、操作して、結果を見て、また操作する逐次意思決定である」ということです。

安全な UI agent では、破壊的 action や外部送信 action に penalty または approval gate を入れることがあります。

\pi(a_t\mid h_t,o_t)=0\quad \text{if } a_t\in\mathcal{A}_{blocked}

これは、特定の危険 action を policy の候補から除外する hard constraint として理解できます。

主なソース

WebArena: https://arxiv.org/abs/2307.13854
VisualWebArena: https://arxiv.org/abs/2401.13649
Mind2Web: https://arxiv.org/abs/2306.06070
Claude Computer Use: https://www.anthropic.com/news/3-5-models-and-computer-use
Browser-Use: https://github.com/browser-use/browser-use

何ができるのか​

Web Agent​

Computer-Use Agent​

入力 modality の選択​

何が難しいか​

安全と guardrails​

数式で見る UI agent の観測と行動​

関連ページ​

主なソース​