[DL輪読会] MoCoGAN: Decomposing Motion and Content for Video Generation

177 Views

September 11, 17

#deep learning #Deep Learning #Video Generation #MoCoGAN #Motion & Content Decomposition #GAN

スライド概要

2017/9/11
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.3K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 35.5K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 34.9K

各ページのテキスト

DEEP LEARNING JP [DL Papers] MoCoGAN: Decomposing Motion and Content for Video Generation Kei Akuzawa, Matsuo Lab M1 http://deeplearning.jp/ 1

http://deeplearning.jp/

書誌情報 • arxiv 2017/07 • authers: Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz • 選定理由: – 生成された動画が既存研究に比べて圧倒的に本物らしい – アイデアがエレガント – 偶然実装中だったので（アニメの中割り自動化したい） 2

MoCoGAN https://github.com/sergeytulyakov/mocogan VGAN http://carlvondrick.com/tinyvideo/ 3

Abstract • 動画はMotionとContentにわけて考えることができる • GeneratorへのInput noiseをMotion partとContent partにわける（独自性） • 結果として、生成される動画が綺麗になり、またContentを固定してMotionだけを変更するような操作が可能になった 4

Introduction • ビデオの生成が画像の生成より難しいと考えられる要因: – （2次元の）見た目だけでなく、（3次元の）物理構造を学習しなければならない – 時間が生み出すmotionのvariationが多い。例えばスクワットにしてもゆっくりやるのと早くやるのでは違う – 人間の目はmotionに対してsensitiveである • "時間(motion)"をどのようにしてモデルに取り入れるかが鍵 5

Related work • The future frame prediction problem系: – 過去のframeで条件付けて未来のframeを予測する – この中でさらに2系統に分かれる • 過去のframeから生のpixelを予想 – Decomposing Motion and Content for Natural Video Sequence Prediction (ICLR2017) など • 過去のframeのpixelをreshuffleして未来のframeを構成 – Unsupervised Learning for Physical Interaction through Video Prediction (NIPS2016) など • GAN系: – Generating Videos with Scene Dynamics (NIPS2016) – Temporal Generative Adversarial Nets with Singular Value Clipping (ICCV2017) • 時間をモデル化するために、それぞれの論文が色々やっている 6

Decomposing Motion and Content for Natural Video Sequence Prediction [Villegas 2017] (MCnet) • MoCoGANと手法は全く違うが、motionと contentを分離するというアイデアは共通 • t期以前の画像からt+1期の画像を予測 – x_tをcontentと捉える – x_t - x_{t-1} をmotionと捉える • デモ↓ – https://sites.google.com/a/umich.edu/r ubenevillegas/iclr2017 7

https://sites.google.com/a/umich.edu/r

Unsupervised Learning for Physical Interaction through Video Prediction [Finn 2016] • 過去のframeのpixelをかき混ぜて新しいframeを作る • 画像をConvolutional LSTMで畳み込んでフィルターを作り、そのフィルターを元画像にあててpixelを再構築（理解浅いです） 8

Generating Videos with Scene Dynamics [Vondrick 2016] (VGAN) • 動画をforeground（動く）と background（動かない）に分割 – 「backgroundを固定」は強い仮定（カメラの手ブレなど） • 同一のnoiseからdeconvでそれらを生成し、加重平均をとる • 画像で条件付けてfuture predictionさせることも可能 • 個人的見解 – 左下図を見るにforegroundの生成が上手くいっていない。contentとmotionを同一のnoiseで扱うことによりモデルの複雑性が増している？ – 画像作ってから足し合わせるのはよくないんじゃないか（ズレに敏感そう） 9

10.

Temporal Generative Adversarial Nets with Singular Value Clipping [M.Saito, Matsumoto, S.Saito 2017] (Temporal GAN) • 3Dの畳み込みを批判（時間と空間の特性の違いを考慮すべき） – ビデオ認識の研究でもこの指摘があるらしい – しかし今回Discriminatorは3Dの畳み込みを利用、Generatorのみ特別仕様 • temporal generatorがframe数だけlatent variableを生成し、それを元にimage generatorが個々の画像を生成 • 生成した2枚の画像間の中間画像も容易に生成できる • WGANを改良(Singular Value Clipping)して学習を安定化 10

11.

Proposed Model: Abstract • VGANとTemporalGANに対する批判 – ビデオを潜在空間上の1点と対応させるのはやりすぎ • 同じactionを異なる速さで行うとき、それらが潜在空間上で異なるpointに mappingされてしまう • 生成するビデオが固定長になってしまう • 提案手法 – 潜在空間上の1点から画像を生成、それらをつなげて動画にする – 潜在空間をmotion subspaceとcontent subspaceにわける • content variableは動画内で固定 • motion variableは動画内で（系列的に）変化 – 結果 • 同じactionを異なる速さで行うときはmotion variabeの変化速度を変えることで対応できる • 任意の長さのビデオを生成できる 11

12.

Proposed Model: Architecture Generator - 潜在変数zはcontent(z_C)とmotion(z_M)の結合 - z_Cは一つの動画内で固定 - z_MはGRUによって生成される - それぞれのz^k から一枚画像を生成 (2DのCNN) Discriminator - D_Iは画像を見分ける (2DのCNN) - D_Vはビデオを見分ける (3DのCNN) - 先行研究（VGAN, TemporalGAN）ではD_Vのみ。画像の本物っぽさをD_Iに任せることで、D_Vは Dynamicsの本物っぽさに注力できる 12

13.

Proposed Model: Training loss function • LossはD_VとD_Iについて和をとる • one sided label smoothing trick [Salimans 2016], [Szegedy 2015] • 可変長のvideoを生み出す工夫 Update – video lengthの経験分布を作る – 分布からvideo lengthをサンプリング – 生成した可変長の動画から、決まった長さを切り取りD_Vに渡す • D_Vは3DのCNNなので固定長しか受け取れないことに注意 13

14.

補足: One sided label smoothing trick [Salimans 2016], [Szegedy 2015] • 予測されたラベルD(x)の値が極端な値をとると、過学習を起こしやすく好ましくない。 • Generatorを固定した元での最適なDiscriminatorを以下のようにしてsmoothing • ただし、分子にp_{model}があると問題 – p_{data]が0に近い場所で、p_{model}が高い確率を割り当てると、Discriminatorをうまく騙せていることになるので、 Generatorが移動するインセンティブを削る • 結局以下のようにする 14

15.

Proposed Model: Action Conditioned • text-to-image[Reed 2016]を参考に、actionで条件付けられるようにモデルを拡張できる – ラベルを埋め込んだもの(z_A)をInput noiseと結合する？（想像） • actionはmotionとcontentの両方に影響すると考えられる（後述） – 例: バスケとホッケーじゃユニフォームが違う • Discriminatorは、真偽とaction labelを同時に見分ける – Auxiliary classifier GAN[odena 2016] ?? – Improved Techniques for Training GANs [Salimans 2016] ?? 15

16.

補足: GANの条件付け図はSricharan 2017 ( https://arxiv.org/abs/1708.05789 )より • DはlabelをInputとして受け取る • • • Dはlabelを予測する Auxiliary classifier GAN [odena 2016] : Dは真偽とラベルのそれぞれを出力する。 Improved Techniques for Training GANs [Salimans 2016]: （ラベル+fake）のK+1次元を出力させる 16

https://arxiv.org/abs/1708.05789

17.

Experiments: Datasets and Metrics • Datasets – synthetic, facial expression, Tai Chi（太極拳）, human action • Performance Metrics 1. Average Content Distance: 一つの動画内でcontentが一貫してほしい • • 普通は色の一貫性を調べる表情の場合はOpenFaceで特徴量抽出し、人物の一貫性を調べる 2. Motion Control Score: Action Conditionedできているかどうか（訓練済みのaction classifierで調べる） 3. Content Control Score: action labelとmotion variableを固定し、 content variableだけを変化させた時に、contentが変化してほしい 17

18.

Experiments: Comparison with VGAN • VGANとMoCoGANの比較 • ACD: 動画内でのcontentの一貫度合いを測る – 色の一貫度合い – open faceで抽出した顔面特徴量の一貫度合い • 二つのデータセットでVGAN を上回る 18

19.

Experiments: various MoCoGAN settings • モデル構造の検証 – DiscriminatorをD_Vだけにする – action labelの組み込み方 • どちらか選ぶ • 結果: – D_Iも使ったほうが良さそう – 𝜖 ′ = [𝜖, 𝑧𝐴 ] が良さそう 19

20.

Experiments: Motion and Content Subspace Dimensions • において、zの次元を60に固定し、z_Mと z_Cの次元をいろいろ動かしてみる • z_Mの次元を大きくしたらMCSがあがると予想できるが、実際はMCSが下がった。z_Cの次元が低すぎると、そもそも顔の生成がうまくできないので、表情認識もうまくいかない。 20

21.

Experiments: User Study 圧倒的ッ…!! 21

22.

Conclusion • Generatorのlatent spaceをcontentとmotionに分割 • motion latent variableはRNNで生成 • 従来手法に比べて精度も良いし、motionとcontentの片方だけを操作することもできるようになった。感想 • 時間のモデル化に色んな研究が苦心していてる • 潜在空間でmotionとcontentを分離するのが、VGANと比べてエレガント 22

23.

References • • • • • • • • • • • Sergey Tulyakov. Ming-Yu Liu. Xiaodong Yang. Jan Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation, arXiv preprint arXiv:1707.04993, 2017. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representation, 2017. C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances In Neural Information Processing Systems , 2016. C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, 2016. M.Saito. E.Matsumoto. S.Saito, Temporal Generative Adversarial Nets with Singular Value Clipping, in ICCV, 2017. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning , 2016 Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016. T.Salimans,I.Goodfellow,W.Zaremba,V.Cheung,A.Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints, December 2015. Kumar Sricharan. Raja Bala. Matthew Shreve. Hui Ding. Kumar Saketh. Jin Sun. Semi-supervised Conditional GANs, arXiv preprint arXiv:1708.05789, 2017. 特に明記がない限り、画像はスライドで引用中の論文より 23