Video Transformers

Video Transformers は、Vision Transformer (ViT) を video へ拡張した model 群です。Action recognition、video classification、video pretraining の主流アーキテクチャになっています。

何が難しいのか

Video は、image よりも token 数が爆発します。 $T \times H \times W$ の patch をそのまま attention に通すと計算量が膨大です。Video transformer の研究は、

を中心に進んできました。

「全 token 同士で attention するのか、空間と時間を分けるのか、局所窓に限るのか」が主な設計軸です。

Video transformer は、

の異なる時間スケールを、token と attention によって柔軟に扱えます。一方、long video (分〜時間) では memory が問題になるため、memory module や clip sampling と組み合わせます。

Video transformer では、frame と patch を合わせた token 列を作ります。Frame 数を $T$ 、各 frame の patch 数を $P$ とすると、token 数は $N=TP$ です。Full attention の計算量は次のようになります。

O((TP)^2d)

ここで、 $d$ は hidden dimension です。この式の気持ちは、「frame 数を増やすと token 数が線形に増えるが、attention 計算量は二乗で増える」ということです。

そのため、factorized attention がよく使われます。空間 attention と時間 attention を分けると、概念的な計算量は次のようになります。

O(TP^2d)+O(PT^2d)

前者は各 frame 内の空間関係、後者は同じ patch 位置や token group の時間関係を扱います。長い video では、この分解により full attention より計算を抑えながら、空間理解と時間理解を両立できます。