【DL輪読会】Imagine yourself: Tuning-Free Personalized Image Generation

696 Views

November 07, 24

#パーソナライズ画像生成 #チューニングフリー #拡散モデル #合成データ #マルチステージファインチューニング

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 87.3K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.9K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 58.4K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 41.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 37.9K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 37.3K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Imagine yourself: Tuning-Free Personalized Image Generation Hiroto Osaka, Matsuo Iwasawa Lab, B4 http://deeplearning.jp/

http://deeplearning.jp/

Paper Info q Title：Imagine yourself: Tuning-Free Personalized Image Generation q Author：GenAI, Meta q Zecheng He, Bo Sun, Felix Juefei-Xu, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Harihar Subramanyam, Alireza Zareian, Li Chen, Ankit Jain, Ning Zhang, Peizhao Zhang, Roshan Sumbaly, Peter Vajda, Animesh Sinha q Abstract q Tuning-Free のパーソナライズ画像生成手法を提案 q 合成ペアデータ生成やマルチステージの Fine-tuning q 最先端のモデルを凌駕し、人間評価でも高性能 [1] Imagine yourself: Tuning-free personalized image generation 2

Background ▍ Personalized Image Generation q Diffusion Model は画像生成タスクにおいて優れた効果を発揮している q 固有のアイデンティティを入力されたコンテキストを反映させた上で出力する q [2] A Survey on Personalized Content Synthesis with Diffusion Models DreamBooth[3], Textual Inversion[4] 以降、100以上の研究が論文として投稿されている [3] Dreambooth: Fine tuning text-to-image diffusion models for subjectdriven generation 3

Background ▍ Tuning-based Personalization Models q 対象ごとの学習を必要とする q q DreamBooth [3], Textual Inversion [4] q q コストがかかるため、非効率的特有のテキストトークンを用いて個別性を表現 LoRA [5] q 軽量な低ランクアダプターのみを調整 [2] A Survey on Personalized Content Synthesis with Diffusion Models 4

Background ▍ Tuning-free Personalization Models q 個別のチューニングを行う必要なく、共通のモデルを使用 q 主な手法：PhotoMaker [6], InstantID [7] など q 強いコピー & ペースト効果 q 特定のポーズから大幅に異なる指示に従うことが困難 q 出力画像の多様性の低下 [8] IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models [2] A Survey on Personalized Content Synthesis with Diffusion Models 5

Method ▍ アプローチの方針 q アイデンティティの保持 q プロンプトへの整合性 q 視覚的魅力 Image Encoder Parallel Cross Attention LoRA Adapter 学習可能なエンコーダーを通し追加の信号としてフィードアイデンティティ情報とテキスト情報を効率的に組み込む元モデルの品質を保ちつつ効率的に Tuning 6

Method ▍ アプローチ q 主に以下の3つの要素により性能の向上を行った Dataset Model Architecture Training Strategy 7

Method ▍ Dataset q 合成データを作成することにより、従来のデータセットの課題に対処 Dataset Model Architecture Training Strategy 8

Method ▍ Synthetic Paired Data（SynPairs） q データセット作成フローキャプションの生成（VLM）キャプションのリライト（Llama3）画像生成（Emu）生成画像のリファインメント類似度に基づいたフィルタリング表情・ポーズ・照明条件などが異なる同一人物の高品質ペアデータを取得 9

10.

Method ▍ Model Architecture q アイデンティティ情報をより保持できるようなアーキテクチャを提案 Dataset Model Architecture Training Strategy 10

11.

Method ▍ Fully Parallel Image-Text Fusion q q Text Encoders q CLIP：Vision Encoder との共有空間 q UL2：長文かつ複雑なプロンプトを理解 q ByT5：テキストのエンコーディング Fully Parallel Iamge-Text Fusion q q テキストと画像の条件を並列的に融合 LoRA q ベースの U-Net の Attention は固定 q 収束速度を最大5倍に向上 11

12.

Method ▍ Training Strategy q 学習の段階をマルチステップに分けることで、それぞれのフェーズに役割を持たせる Dataset Model Architecture Training Strategy 12

13.

Method ▍ Multi-Stage Finetune q 1st ステージ：実データ事前学習 q q q q 参照画像に基づいて画像生成を行う能力を獲得 2nd ステージ：合成データ事前学習 q プロンプトの整合性向上を狙う q アイデンティティの保持はそこまで上昇しない 3rd ステージ：高品質実データFine-tuning q アイデンティティ保持を改善 q プロンプト整合性は若干低下 4th ステージ：高品質合成データ Fine-tuning q アイデンティティ保持とプロンプト整合性のバランス最適化 13

14.

Experiments ▍ Qualitative Evaluation q アイデンティティを保持しながらプロンプトに忠実に従った画像を生成 Single Subject Multi Subject 14

15.

Experiments ▍ Quantitative Evaluation（Human Evaluation） q 評価用のデータセットを作成 q 51のアイデンティティと65のプロンプトリスト q 最先端の Adapter-base, Control-base model と比較 q 3つの評価軸からモデルを評価 q Control-base はコピー & ペースト効果が高いため、評価は高くなる 15

16.

Experiments ▍ Ablation Study q q Multi-Stage Fine-tunig q 合成ペアデータの影響 q 合成データ：プロンプト整合性に効果 q プロンプト整合性に大きく寄与 q 実データ：アイデンティティ保持に効果 q 合成データは顔が完全に一致しないため、アイ並列アテンション q デンティティの保持の評価が低下標準的なトークン結合と比較して性能向上 16

17.

Future Work & Conclusion ▍ Future Work q 動画生成への拡張 q 複雑なポーズを含むプロンプトへの追従性改善 ▍ Conclusion q 画像の多様性を促進するための新しい合成ペアデータ生成メカニズム q 3つのテキストエンコーダーと学習可能な視覚エンコーダーを備えた完全並列アテンションアーキテクチャ q 視覚的品質を段階的に強化する新しい粗から細へのマルチステージファインチューニング手法 17

18.

References 1. He, Z., Sun, B., Juefei-Xu, F., Ma, H., Ramchandani, A., Cheung, V., ... & Sinha, A. (2024). Imagine yourself: Tuning-free personalized image generation. arXiv preprint arXiv:2409.13346. 2. Zhang, X., Wei, X., Zhang, W., Wu, J., Zhang, Z., Lei, Z., Li, Q. (2024). A Survey on Personalized Content Synthesis with Diffusion Models. arXiv preprint arXiv:2405.05538. 3. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22500-22510). 4. Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618. 5. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. 6. Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M. M., & Shan, Y. (2024). Photomaker: Customizing realistic human photos via stacked id embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8640-8650). 7. Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A., Li, H., ... & Hu, Y. (2024). Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519. 8. Cui, Siying, et al. "IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024 18