【DL輪読会】HyperSeg: Towards Universal Visual Segmentation with Large Language Model

2.8K Views

February 13, 25

#Visual Segmentation #Large Language Model #Multimodal Learning #Video Segmentation #Computer Vision

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 87.1K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.9K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 58K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 41.2K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 37K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 36.9K

各ページのテキスト

DEEP LEARNING JP [DL Papers] HyperSeg: Towards Universal Visual Segmentation with Large Language Model 国際航業（株）林 http://deeplearning.jp/ 1

http://deeplearning.jp/

書誌情報 • タイトル：HyperSeg: Towards Universal Visual Segmentation with Large Language Model • 投稿先：arxiv(2024年11月末) • Code：https://github.com/congvvc/HyperSeg – 学習コードはまだ未公開（公開予定あり） • 選定理由： – マルチモーダル（特に言語と画像のフュージョン）に興味 – 様々がタスクに通用するネットワークで対応 – 様々なbenchmarkで高精度を示した

https://github.com/congvvc/HyperSeg

概要 • 様々なpromptに従った多種類のsegmentationタスクを1つのモデルで対応 – Visual Large Language Models (VLLMs)より、言語に関する知識を習得 – temporal adapterを提案し、時系列情報を理解して動画にも対応

本手法が対応するpromptの種類 • text prompts – class names, reasoning questions, referring languages • visual prompts – box, mask, pointなど

関連研究 • VLLM – 通常は画像のコンテンツを描写する文書を出力。画素レベルの認識に対応不可 – 代表的な手法：BLIP-2, Flamingo, MiniGPT-4, LLaVA, InstructBLIP , Qwen-VL等 • Perception with VLLM – bboxをpromptとして与えて、grounding能力を示した – mask decoderをつけることでsegmentationも可能 – PSALMが初めてVLLMを導入したが、VLLMの性能を十分に引き出せていない • Unified segmentation model – Mask2formerはunifiedネットワークで様々なsegmentationタスクに対応できるが、タスク毎に学習する必要がある – OpenSeeDはtext encoderを追加し、Open-Set settingに対応。UNINEXT類似する構造でreferring segmentationに対応。ただし、複雑な文書への対応が困難

提案手法のネットワーク概要 • ネットワーク構成：vision encoder, VLLM, segmentation predictor • 入力：vision-prompt pairs 𝒱, 𝒫 • 出力：入力promptに応じたsegmentation masks, class scores, instance embedding（動画の場合）

Prompt design • モデルの入力：vision-prompt pairs 𝒱, 𝒫 • prompt 𝒫をtext / visual promptに分類 – 𝒫ℒ : どのようなタスク(instruction) – 𝒫𝒞 : 具体的なタスク条件 – visual promptは、その座標でCLIP visual特徴量からsampling 大項目具体的なタスク prompt例 class-based segmentation • • • panoptic segmentation open-vocabulary segmentation (OVS) video instance segmentation (VIS) 𝒫ℒ : “Please segment all the positive objects according to the following potential categories.” 𝒫𝒞 : “[category 1, category 2, category 3, ...]” referring and reasoning segmentation • • • • referring expression segmentation (RES) 𝒫ℒ : “Can you perform referring or reasoning segmentation according reasoning segmentation to the language expression?” referring video object segmentation (R-VOS) 𝒫𝒞 : “[referring / reasoning text]” ReasonVOS visual-guided segmentation • • interactive segmentation video object segmentation (VOS) 𝒫ℒ : “Please segment according to the given visual region reference” 𝒫𝒞 : “[vision 1, vision 2, vision 3, ...]”.

Vision Encoder • 従来手法は、CLIPのvisual encoderの特徴のみ利用することが多い – 課題：粒度が高いsegmentationタスクに対して情報が不十分 • Fine-grained Visual Perceiver（FVP）を提案し、粒度の高いvisual情報を抽出（VLLMの入力とする） – pyramid vision encoderにより異なるスケールの特徴を抽出 (𝑖) – 各スケール特徴𝑓𝑖𝑚𝑔 とfined-grained token 𝑃𝑗 を条件付き重み付きcross-attentionにより、情報を集約

Visual Large Language Model • VLLMは既存モデルを利用 – visual encoder(CLIP)と軽量化のLLMで構成 • LLMの入力： – vision token 𝑓𝑣 : CLIP encoderの出力から取得（画像全体のvisual情報） – fined-grained token 𝑃： FVPの出力 – prompt token 𝒫 • LLMの出力 – prompt embedding – semantic recognition – mask tokens – fine-grained tokens segmentation predictorに入力

10.

Hybrid Entity Recognition • LLMを介したsegmentationは3つの流派 ① クラスとマスクをLLMが生成：漏れや誤検出が多い傾向 ② クラスとマスクをmask decoderが推定(LLMがprompt tokenをembedする役割):LLMの強力なセマンティックな能力を活かさず ③ 本論文はハイブリッドな方式を提案：prompt embeddingをdecodeする。入力画像にあるすべての物体のクラスとそのmask tokenを別々で生成 • mask tokenと対応するsemantic情報を取得

11.

Segmentation predictor • 基本構造は、Mask2Formerを採用 – 3つの入力でmaskと分類scoreを推定 𝐾 • task-specific prompt embedding 𝐸𝒫𝑘 𝑘=1 , 𝐾 =カテゴリー数 • semantically enhanced mask tokens 𝑗 𝑁 𝐸𝒬 , 𝑁 =mask推定個数 𝑗=1 • multi-scale visual features 𝑓𝑖𝑚𝑔 – 動画を扱う場合、instance embedding 𝑒を推定 • 動画は、フレーム毎にsegmentationを実施

12.

Temporal Adapter • 動画を処理する場合、フレーム間の整合性をとる必要がある • 本論文は、global prompt aggregationとlocal space-time information injectionを提案 – 前の全フレームのprompt embeddingをpooling – 前の1枚フレームのfine-grained tokenから更新

13.

学習目的関数 • 各タスクに共通するloss関数で学習 – ℒ𝑡𝑒𝑥𝑡 : autoregressive cross-entropy loss for text prediction • 論文中詳細は言及せず。Supplementaryの情報から、image captionやVQAのようなタスクで visual-languageの理解を保持（？） – ℒ𝑚𝑎𝑠𝑘 : mask推定loss – ℒ𝑐𝑙𝑠 : カテゴリー分類cross entropy loss – ℒ𝑖𝑛𝑠 : contrastive loss for instance association（動画の場合）

14.

実験設定 • 合計10個のタスクを同時に学習 – 各タスクは約16k iterationを学習 – VLLMはMipha3Bを採用。LLM部分の学習はLoRA採用 • LLMはPhi-2-2.7Bを採用。Visual encoderはSigLIPを採用 – Segmentation predictorはMask2Formerを採用 – 8 NVIDIA A100 GPUsで学習（batch size=32）

15.

実験結果- Referring expression segmentation • RefCOCO/+/gにおいて、SOTAを達成 • 更に難しいgeneralized referring expression segmentationでも有効 – Zero-shot形式で評価

16.

実験結果-Reasoning segmentation • 動画と画像ドメインにおいて、SOTAを達成

17.

実験結果-Generic image segmentation • closed-set and open-vocabulary segmentation両方に効果を確認

18.

実験結果-Common video segmentation • 具体的には、visual-prompted semi-supervised VOS (DAVIS17), textprompted referring video object segmentation (Ref-YouTube-VOS, RefDAVIS17), video instance segmentation (YouTube-VIS 2019)で評価

19.

実験結果-Ablations • 複数タスクの同時に学習することで、モデル性能を向上 – 特に動画segmentationタスクにおいて、画像segmentationも学習する効果が大きい • 提案手法は別のLLMでも効果を発揮

20.

実験結果-Ablations • 提案のFVPとHERの有効性を確認 • 動画に対し、global prompt aggregation（全フレーム情報のpooling）と local space-time information injection（前1フレームの情報を更新）の効果を確認

21.

まとめ • VLLMを利用し、様々なsegmentationタスクを一つのモデルで対応できる手法を提案 – 異なるスケールのvisual情報を利用 – VLLMの出力形式に工夫 – 画像と動画両方に対応可能。特に画像関連の複数タスクでは、SOTAを達成 • 所感 – 既存のモデルをうまく組み合わせて、様々なタスクを一つ比較的に小さいモデルで対応 – Mask2Formerの形に合わせてLLMを組み合わせた気もしなくない