[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

217 Views

March 08, 17

#deep learning #StackGAN #Generative Adversarial Networks #Image Synthesis #Deep Learning #Machine Learning

スライド概要

2017/3/8
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 92.4K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 71.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.6K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 55.2K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 52K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 50.2K

各ページのテキスト

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks M1 Shota SUGIHARA

書誌情報 • StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks • arXiv (https://arxiv.org/abs/1612.03242) • Submitted on 10 Dec 2016 • Authors: Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, Dimitris Metaxas • 選定理由：⽣成モデルへの興味 2

https://arxiv.org/abs/1612.03242)

概要 • 多層化したGANで学習させることで，説明⽂のみから 256×256画素の画像を⽣成 • GANを2段階に分ける． • Stage-I GAN • 与えられた説明⽂とノイズから，元となる低解像度の画像を⽣成 • Stage-II GAN • 再び説明⽂の条件から，Stage-Iでの⽋損部分を修正するように⾼解像度の画像を⽣成 3

実装 4

Stage-I GAN • 説明⽂をtext embedding 𝜑 𝑡 に変換 • ⾼次元（𝜑 𝑡 >100次元） • 潜在変数の多様体が不連続になり，学習に好ましくない． • Conditioning Augmentation • ガウス分布𝑁 𝜇 𝜑 𝑡 , ∑ 𝜑 𝑡 からランダムにサンプリング • 損失関数 5

Stage-II GAN • Stage-Iの低解像度画像を元に，⾼解像度の画像を⽣成する． • Stage-Iで⽣じた画像の歪み，情報の⽋損を修正するためtext embedding 𝜑 𝑡 を再び条件に加える． • 損失関数 • 𝑠* はStage-Iで⽣成された画像． 6

実験 • テストデータは2種類 • Caltech-UCSD Bird (CUB) • 200種類の⿃を11788枚含むデータセット • Oxford-102 • 102種類の花を8189枚含むデータセット • ⽐較対象：GAN-INT-CLS, GAWWN • 定量的評価：inception score, human rank (10⼈) 7

⽐較結果: CUB 8

⽐較結果: CUB • GAN-INT-CLSは⼤まかな特徴を捉えているだけであり， realisticな画像も⼗分な解像度も満たしていない． • GAWWNは条件変数を追加することでより良い結果が出たが，説明⽂のみの条件では本物らしい画像を⽣成できない． • StackGANは説明⽂のみで，256×256画素のrealisticな画像⽣成に成功した． 9

10.

⽐較結果: Oxford-102 10

11.

⽐較結果 • Inception score, Human rankともに，最も⾼いスコアを得た． 11

12.

結果: Stage-I, II間 • Stage-Iでは，凡そ最もらしい⾊や形を捉えているものの，細部の⽋損や間違いが⾒られる．Stage-IIでは，詳細部分が修正され，より説明を反映した画像が⽣成されている． 12

13.

結果: training dataとの⽐較 • ⽣成された画像と，それに近いtraining dataをL2距離から導出し，⽐較した． 13

14.

検証: Component analysis • 提案⼿法の検証 • Conditioning Augmentation 14

15.

検証: Sentence embedding interpolation 15

16.

失敗例 • 筆者らはStage-Iで特徴を捉えられなかったためと主張している． 16

17.

失敗例 • 筆者らはStage-Iで特徴を捉えられなかったためと主張している． 17

18.

まとめ • photo-realisiticな画像⽣成のためのStackGANを提案した． • ⽣成過程を2段階にすることで，Stage-Iで説明⽂から⼤まかな特徴を捉え，Stage-IIでそれを修正し鮮明な画像⽣成に成功した． • 既存⼿法と⽐較して，定性的，定量的に提案⼿法が優れていることを⽰した． 18