Vision-Language Models | Haruk1y Wiki

📄️Vision-Language Models Overview

CLIP、SigLIP、BLIP、LLaVA、Grounding DINO など、image と language を結ぶ foundation model の全体像を整理します。

Image と text を contrastive learning で揃える CLIP の architecture と zero-shot 能力を整理します。

Softmax の代わりに sigmoid loss を使う SigLIP の特徴と CLIP との違いを整理します。

Captioning と VQA に強い BLIP / BLIP-2 の構造と Q-Former を整理します。

LLaVA、Qwen-VL、InternVL、GPT-4V、Gemini など LLM ベース VLM の流れを整理します。

Grounding DINO、OWL-ViT、GLIP など open-vocabulary object detection の流れを整理します。