Iclr2016 vaeまとめ

ICLR2016 VAEまとめ鈴⽊雅⼤

今回の発表について ¤ 今⽇の内容 ¤ ICLRで発表されたVAE関連を中⼼に発表します． ¤ ICLR 2016 ¤ 2016年5⽉2⽇~4⽇ ¤ プエルトリコ，サンフアン ¤ 発表数： ¤ 会議トラック：80 ¤ ワークショップ：55

ICLR2016のトレンド http://www.computervisionblog.com Unsupervised Learning Backprop Tricks Deep Metric Learning Computer Vision Applications Reinforcement Learning Incorporating Structure Attention Training-Free Methods Geometric Methods Incorporating Structure Gaussian Processes and Auto Encoders Visualizing Networks Initializing Networks Compressing Networks Do Deep Convolutional Nets Really Need to be Deep? ResNet

http://www.computervisionblog.com

4.

ICLRにおけるVAE論⽂ ¤ ICLRに採録されたVAE（もしくはVAEに関連する）論⽂は5本 ¤ Importance Weighted Autoencoders ¤ The Variational Fair Autoencoder ¤ Generating Images from Captions with Attention ¤ Variational Gaussian Process ¤ Variationally Auto-Encoded Deep Gaussian Processes ¤ VAEを基礎から説明しつつ，これらの論⽂の説明をします．

5.

識別モデルと⽣成モデル⽣成モデル識別モデル（識別関数） 𝑝 𝐶# 𝑥 分類確率（事後確率）をモデル化 𝑝 𝐶# 𝑥 = 𝑝(𝑥, 𝐶# ) 𝑝(𝑥) データの分布（同時分布）をモデル化 ¤ データを分けることのみに興味があるのが識別モデル ¤ 深層NNやSVMなどは識別モデル（正確には識別関数） ¤ データの⽣成源も考えるのが⽣成モデル

6.

変分推論とVAE

7.

452 9. MIXTURE MODELS AND EM 背景知識：変分推論 Figure 9.12 Illustration of the E step of KL(q||p) = 0 the EM algorithm. The q distribution is set equal to the posterior distribution for the current parameter values θ old , causing the lower bound to move up to the same value as the log likelihood function, with the KL L(q, θ old ) divergence vanishing. ¤ ⽣成モデルの学習＝データから分布𝑝(𝑥)をモデル化したい shown in Figure 9.13. If we substitute q(Z) = p(Z|X, θ ➡尤度𝑝(𝑥)を最⼤化することで求められる． after the E step, the lower bound takes the form L(q, θ) = ! Z p(Z|X, θ old ) ln p(X, Z|θ) − = Q(θ, θ old ) + const ! ln p(X|θ old ) old ) into (9.71), we see that, p(Z|X, θ old ) ln p(Z|X, θold ) Z (9.74) where the constant is simply the negative entropy of the q distribution and is there¤ 𝑝 𝑥 = ∫ 𝑝 𝑥, 𝑧 𝑑𝑧 のように潜在変数𝑧もモデル化した場合・・・ fore independent of θ. Thus in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood, as we saw earlier in the case of mix- ¤ そのまま最⼤化できない． tures of Gaussians. Note that the variable θ over which we are optimizing appears only inside the logarithm. If the joint distribution p(Z, X|θ) comprises a member of the exponential family, or a product of such members, then we see that the logarithm will cancel the exponential and lead to an M step that will be typically much simpler than the maximization of the corresponding incomplete-data log likelihood function p(X|θ). The operation of the EM algorithm can also be viewed in the space of parameters, as illustrated schematically in Figure 9.14. Here the red curve depicts the (in- ¤ よって，代わりに対数尤度を常に下から抑える下界を最⼤化する． ¤ 𝑝 𝑧|𝑥 を近似する分布𝑞(𝑧|𝑥)を考える． Figure 9.13 Illustration of the M step of the EM algorithm. The distribution q(Z) KL(q||p) is held fixed and the lower bound L(q, θ) is maximized with respect to the parameter vector θ to give a revised value θ new . Because the KL divergence is nonnegative, this causes the log likelihood ln p(X|θ) to increase by at least as much as the lower bound does. ¤ このとき，対数尤度は次のように分解できる． log 𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) 下界 L(q, θ new ) 真の分布と近似分布の差必ず0以上になる ln p(X|θ new )

8.

Variational Autioencoder ¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14] ¤ 確率分布を多層ニューラルネットワークで表現した⽣成モデル ¤ 単純のため，潜在変数は𝑧のみとする 𝑧 近似分布＝エンコーダーと考える 𝑧~𝑝(𝑧) 𝑥~𝑝(𝑥|𝑧) 𝑞(𝑧|𝑥) デコーダーと考える 𝑥 ¤ 最⼤化する下界は 𝐿 𝐱 = 𝐸89 @ 𝑝; 𝐱, 𝐳 (𝒕) 𝑝; 𝐱, 𝐳 1 𝐳 𝐱 log 𝑞 𝐳 𝐱 = 𝑇 ? log 𝑞< 𝐳 (𝒕) 𝐱 < reparameterization trick ABC ただし，𝐳 (𝒕) = 𝝁 + diag 𝝈 ⨀𝜺, 𝜺~𝑵(𝟎, I)

9.

(l) (l) (l) where z µ + σ ⊙ ϵ , ϵ ∼ N (0,model I). (l) (l) = settings and found that their proposed can extract better representations than sing z = µ + σ ⊙ ϵ , ϵ Figure ∼ N (0, I). 2: The network architecture of MVAE. This represents the sam settings. The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are ost significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are (l) two negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same gative reconstruction terms in Eq. (8). These terms are correspondent Same (l) to each modality. (l) as VAE, we call qφ (z|x, w) as encoder and both p (x|z ) and p (w|z ) as decoder. Srivastava & Salakhutdinov (2012) used deep restricted Boltzmann machines (RBM), w θ θ x w , we call qφ (z|x, w) as encoder and both pθx (x|z(l) ) and pθw (w|z(l) ) as decoder. Hense, #% toand $ & as deepsettings. earliest deep generative model, multimodal learning The same as one We$ can parameterize encoder decoder distribution neural networks. Figs.2 draws the by usin parameterize encoder and decoder as deep neural Figs.2 draws the model which is same asdistribution Fig. 1 but represented by deepnetworks. neural networks. they jointed latent variables in multiple networks and tried to extract high-level feature L(x,byw) which is same as Fig. 1 but represented deep = neural−D networks. KL (qφ (z|x, w)||p(z)) VAEのモデル化 Considering the encoder (z|x, w)In as their a Gaussian distribution, we can showed estimate mean variance timodal features: images andqφtexts. experiment, they thatandtheir model ou rk architecture of MVAE. This represents the same model as Fig.1. ¤ ニューラルネットワークによってモデル化する ering the encoder (z|x, w) as a Gaussian distribution, we can estimate mean and variance ofqthe distribution by neural networks as follows: φ +E [log pθxmay (x|z)] + Eqbetter [log p qφ (z|x,w)models φ (z|x,w) Ngiam et al. (2011). It suggests that deep generative extract represen istribution by neural networks as follows: y(x) = MLP (x) φ discriminative ones. y(x) = MLPφ y(w) (x) = MLPφ (w) 推論モデル By SGBM algorithm, the estimator of the⽣成モデル lower bound is as follows: = x x w y(w) = MLPφw (w) µφ = Linear(y(x), y(w)) = Linear(y(x), y(w)) L̂(x, = −DKL (qy(w)), −DKL (qφ (z|x, µ w)||p(z)) 2.2 VARIATIONAL log σ 2φ w) = Tanh(Linear(y(x), sampling φ AUTOENCODERS φ (z|x, w)||p(z)) 𝑧~𝑝(𝑧) (9) 2 ・・・・・・・・・ log σ = MLP Tanh(Linear(y(x), y(w)), (9) where MLP neural networks corresponding each modality. φ+ L w mean +Eqφ (z|x,w) [log pφθxxand (x|z)] Eqφ deep [log p (w|z)] (7) Moreover, θ (z|x,w) w ! 1 Linear(a,etb)al., Linear and Tanh mean a linear layer and a2014; tanh layer. means that this network haspropose Variational autoencoders (VAE) (Welling, Rezende 2014) are recent (l) ( MLP and MLP mean deep neural networks corresponding each modality. Moreover, φx φ + log p (x|z ) + log p (w|z w θx θw multiple input layers, which are corresponding to a and b. erative r and Tanh models. mean a linear layer and this network has L that ・・・・・・a tanh layer. Linear(a, b) means estimator of the lower bound is as follows: l=1 Because modality hastodifferent feature representation, we should make different networks for 𝑞(𝑧|𝑥) e input layers, which areeach corresponding a and b. each decoder, pθ (x|z) and pθand (w|z). The type of distribution and・・・ the network architecture depend their Given observation variables x corresponding latent variables z, we consider ・・・ (l) (l) (l) w) −D (q w)||p(z)) e each= modality has feature representation, should make for KL where zφ (z|x, = µ σ modality, ⊙ ϵ ,e.g., ϵweGaussian ∼ N (0, I).different on thedifferent representation of+each distribution whennetworks the representation of modalprocesses as follow: coder, pθ (x|z) ity andisLpcontinuous, The typedistribution of distribution thevalue, network architecture whenand binary 0 or 1. In case that depend pθ (w|z) is Bernoulli θ (w|z). Bernoulli 𝑥~𝑝(𝑥|𝑧) ・・・・・・ ! B(w|µ the parameter of Bernoulli distribution µθ can estimate as follows: epresentation ofThe each modality, e.g., distribution when the representation of modal1distribution θ ),Gaussian most significant the estimator of VAE’s lower bound (E (l) difference from(l) + log pθxwhen (x|zbinary ) z+value, p0 θorw=1. (w|z (8) ntinuous, Bernoulli distribution In case that pθ (w|z) Bernoulli ∼log p(z); x ∼ p), (x|z), x̂ŵisterms θ(8). y(z) MLP (z) θ two negative reconstruction terms in Eq. These are correspondent tion B(w|µθ ),Lthe parameter of Bernoulli distribution µ can estimate as follows: l=1 µθ = θ Linear(y(z)) (10) (l) (l) as VAE, we call q (z|x, w) as encoder and both p (x|z ) and p (w|z φ θ θ where θ means the model parameter of p. x w y(z) = MLPθ (z) (l) , ϵ ∼ N (0, In I). case that the decoder is Gaussian distribution, you can estimate the parameter (10) of this distribution µ = Linear(y(z)) θ We can encoder andofdecoder distribution deep neural In varaitional we(9), consider where φnetwork is theismodel of q,netw in in theinference, sameparameterize way as Eq. except thatqthe input the Linear single. asparameter φ (z|x), ence from the estimator of (Eq. (3)) that there are model which isofVAE’s same aslower Fig. 1bound butThe represented by deep neural networks. proximate the posterior distribution (z|x). goal ofofis this is that maximize t that the decoder is Gaussian distribution, you can the parameter thisproblem distribution The main advantage this model ispfollowing: θestimate x x w w w w w w w w w w

10.

良いモデルを学習するには？ ¤ 最⼤化しているのは下界だが，本当は（対数）尤度を最⼤化したい ¤ 下界が対数尤度をうまく近似できればよい ¤ これは近似分布がどれだけ近似できるか次第 𝐿 𝑥 = log 𝑝 𝑥 − 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) 近似分布が真の事後分布を近似できれば KL距離は０になる＝下界と対数尤度が等しくなる ¤ しかし実際には，近似分布はVAEの下界によって制約を受けてしまう． ¤ 事後サンプルが少しでも𝑥を説明できないと，⼤きな制約となる． ¤ 解決策： ¤ より対数尤度を近似するような新たな下界を考える

11.

Importance Weighted AE ¤ Importance Weighted Autoencoders [Bruda+ 15; ICLR 2016] ¤ 次のような新たな下界を提案 ¤ サンプル数kによる重要度重み推定量 𝐿# 𝑥 = 𝐸RS ,…,RU ~89 X 𝑝; 𝐱, 𝐳 (𝐤) 1 𝑧 𝑥 log 𝑘 ? 𝑞< 𝐳 (𝐤) 𝐱 #BC ¤ この下界は，次の関係が証明されている log 𝑝 𝑥 ≥ 𝐿#ZC 𝑥 ≥ 𝐿# 𝑥 ≥ 𝐿C 𝑥 = 𝐿 𝑥 ¤ サンプル数を増やすだけで，制約が緩和され，真の下界に近づく．

12.

IWAE : 実験結果 ¤ テスト尤度が向上していることが確認できる Under review as a conference paper at ICLR 2016 MNIST OMNIGLOT k VAE active NLL units IWAE active NLL units 1 1 5 50 86.76 86.47 86.35 19 20 20 86.76 85.54 84.78 19 22 25 108.11 107.62 107.80 28 28 28 108.11 106.12 104.67 28 34 41 2 1 5 50 85.33 85.01 84.78 16+5 17+5 17+5 85.33 83.89 82.90 16+5 21+5 26+7 107.58 106.31 106.30 28+4 30+5 30+5 107.56 104.79 103.38 30+5 38+6 44+7 # stoch. layers VAE active NLL units IWAE active NLL units Table 1: Results on density estimation and the number of active latent dimensions. For models with two latent layers, “k1 +k2 ” denotes k1 active units in the first layer and k2 in the second layer. The generative performance of IWAEs improved with increasing k, while that of VAEs benefitted only slightly. Two-layer models achieved better generative performance than one-layer models. The log-likelihood results are reported in Table 1. Our VAE results are comparable to those previously reported in the literature. We observe that training a VAE with k > 1 helped only slightly. By contrast, using multiple samples improved the IWAE results considerably on both datasets. Note that

13.

条件付きVAEと半教師あり学習

14.

VAEによる半教師あり学習 ¤ Semi-Supervised Learning with Deep Generative Models [Kingma+ 2014 ; NIPS 2014] ¤ 条件付きVAE（CVAE）による半教師あり学習ラベル 𝑦 𝑧 𝑥 ] 𝐱,𝐳|𝐲 ¤ 条件付きVAEの下界は 𝐿 𝐱|𝒚 = 𝐸89 𝐳 𝐱, 𝐲 log ^ 89 𝐳 𝐱, 𝐲 ¤ よって，下界は 𝐿 𝐱 + 𝐿 𝐱|𝐲 + 𝛼𝔼[−log𝑞< 𝐲 𝐱 ] ¤ 最後の項はラベル予測するモデル

15.

CVAE半教師あり学習の問題点 ¤ ラベルと潜在変数はモデル上独⽴になっている ¤ しかし，近似分布が𝑞< 𝐳 𝐱, 𝐲 となっているため，𝐲と𝐳に依存関係が⽣じてしまう 𝑦 𝑧 𝑥 ¤ データの情報を保持しながら，ラベルとは独⽴な潜在変数を獲得したい． ¤ 𝐲をドメインと考えれば，ドメインを除去した表現が獲得できるはず．

16.

et al., 2006): (7) N0 X N0 N1 X N1 N0 X N1 X X X 1 1 2 0 0 0 0 kx x0 k2 `MMD (X, X0 ) for = a2 universal k(xkernel , x ) + k(x , x ) k(x , x ). Asymptotically, such as the Gaussian kernel k(x, x ) = e , n m n n m m N0 n=1 m=1 N12 n=1 m=1 N0 N1 n=1 m=1 0 `MMD (X, X ) is 0 if and only if P0 = P1 . Equivalently, minimizing MMD can be viewed as (7) matching all of the moments of P and P . Therefore, we can use it as an extra “regularizer” and Variational Fair Autoencoder 0 1 forceAsymptotically, the model to for try atouniversal match the moments the marginal posterior of our kx x0 k2 kernel such asbetween the Gaussian kernel k(x, x0 ) = edistributions , latent`MMD variables, = if0)Pand q (z |s = 1) (in the case of binary nuisance information 1 |sonly (X, X0 )i.e., is 0qif (z and 0 = P1 . 1Equivalently, minimizing MMD can be viewed as 1 matching of the moments of Pinto P1 .lower Therefore, we of canour useaforementioned it as an extra “regularizer” and s ). By addingallthe MMD penalty bound VAE architecture we 0 andthe force the model tomodel, try Fair to match the momentsFair between the marginal(VFAE): posterior distributions our proposed theAutoencoder “Variational Autoencoder” ¤obtain The Variational [Louizos+ 15; ICLR 2016]of our latent variables, i.e., q (z1 |s = 0) and q (z1 |s = 1) (in the case of binary nuisance information 1 ( , ✓; x , x , s , s , y ) = F F¤ ( , ✓; xn , x , snaforementioned , sm , yn ) VAE `MMD (Z1 s=0 ,we Z1 s=1 ) (8) s𝑥と𝑠（sensitive変数．前ページでいうラベル）を独⽴にするために，次の ). By adding penalty lower bound ofmour architecture VFAE n the m MMD n m n into the VAE obtain our proposed model,discrepancy（MMD）を⼩さくするようにする． the “Variational Fair Autoencoder” (VFAE): maximum mean where: FVFAE ( , ✓; xn , xm , sn , sm , yn ) = FVAE ( , ✓; xn , xm , sn , sm , yn ) `MMD (Z1 s=0 , Z1 s=1 ) = k Ep̃(x|s=0) [Eq(z1 |x,s=0) [ (z1 )]] where: 2.4 `MMD (Z1 s=0 , Z1 s=1 ) (8) Ep̃(x|s=1) [Eq(z1 |x,s=1) [ (z1 )]]k2 (9) 2 `FMMD (Z , Z ) = k E [E [ (z )]] E [E [ (z )]]k (9) 1 1 1 1 p̃(x|s=0) q(z |x,s=0) p̃(x|s=1) q(z |x,s=1) s=0 s=1 1 1 AST MMD VIA R ANDOM F OURIER F EATURES ¤ s=0とs=1のときの潜在変数の差がなくなるようにする． 2.4 FAST MMD VIAof R ANDOM F OURIER F EATURES ¤implementation これをVAEの下界に追加する． A naive MMD in minibatch stochastic gradient descent would require computing the MA⇥M matrix for each minibatch training, is therequire minibatch size. Instead, naive Gram implementation of MMD in minibatchduring stochastic gradientwhere descentMwould computing we can random Recht, 2009) to compute a featuresize. expansion theuse M ⇥M Gramkitchen matrix forsinks each (Rahimi minibatch & during training, where M is the minibatch Instead, such that ¤ we MMDは通常カーネルの計算に持っていく． computing estimator (6) sinks approximates the full2009) MMD (7). Toa compute this, we draw can the use random kitchen (Rahimi & Recht, to compute feature expansion such that a random the estimator the full MMD To is compute this, weof draw a random K ⇥ computing D matrix W, where (6) K approximates is the dimensionality of (7). x, D the number random features and ¤ しかし，SGDで⾼次元のグラム⾏列を計算するのは⼤変なので，写像 ⇥ Dofmatrix wherefrom K is athe dimensionality of x,Gaussian. D is the number of random features is andthen given each K entry W isW, drawn standard isotropic The feature expansion を次の形で求める of W is drawn from a standard isotropic Gaussian. The feature expansion is then given as: each entry as: r ✓r ◆ r 2 ✓r ◆ 2 (x) = 2 coscos 2 xWxW (10) W + b +. b . (10) W (x) = D D a D-dimensionaluniform uniform random withwith entries in [0, in 2⇡].[0,Zhao Meng & (2015) wherewhere b isba isD-dimensional randomvector vector entries 2⇡].&Zhao Meng (2015) successfully appliedthe the idea idea of kitchen sinkssinks to approximate MMD. This esti- This estihave have successfully applied of using usingrandom random kitchen to approximate MMD. is fairly accurate,and andisis typically typically much faster thanthan the full penalty.penalty. We use D 500 D = 500 matormator is fairly accurate, much faster theMMD full MMD We= use in our experiments.

17.

実験：公平性の検証 Under review as a conference paper at ICLR 2016 ¤ zからsの情報がなくなっているかどうかを検証 ¤ zからsを分類したときの正解率で評価 (a) Adult dataset (b) German dataset (c) Health dataset Figure 3: Fair classification results. Columns correspond to each evaluation scenario (in order): Random/RF/LR accuracy on s, Discrimination/Discrimination prob. against s and Random/Model

18.

Under review as a conference paper at ICLR 2016 実験：ドメイン適応の検証 2 L EARNING I NVARIANT R EPRESENTATIONS Under review as a conference paper at ICLR 2016 ¤ 異なるドメイン間でのドメイン適応 y ¤ 半教師あり学習で実験（⽬標ドメインのラベルがない）． ¤ the Amazon reviews dataset z network based state of the art method for domain is concerned, we compared against asrecent neural Domain Adversarial Neural Network (DANN) (Ganin et al., 2015). As we can observe ¤adaptation, 𝑦はセンチメント（positiveかnegative） in table 1, our accuracy on the labels y is higher on 9 out of the 12 domain adaptation tasks whereas on the remaining 3 it is quite similar to the DANN architecture. x Table 1: Results on the Amazon reviews dataset. The DANN column is taken directly from Ganin N et al. (2015) (the column that uses the original representation as input). 12のうち9が既存研究([Ganin+ 15])を上回った ¤ 結果： ¤ Figure 1: Unsupervised model S Source - Target Y s z2 z1 x N Figure 2: Semi-supervised model RF LR VFAE DANN books - dvd 0.535 0.564 0.799 0.784 books - electronics 0.541 0.562 0.792 0.733 2.1 Ubooks NSUPERVISED MODEL - kitchen 0.537 0.583 0.816 0.779 dvd - books 0.537 0.563 0.755 0.723 dvd out - electronics 0.538 0.566 0.754 Factoring undesired variations from0.786 the data can be easily formulated as a general probabili dvd - kitchen 0.543 0.589 0.822 0.783 model which admits two distinct (independent)0.713 “sources”; an observed variable s, which denotes electronics - books 0.562 0.590 0.727 variations that we want to 0.556 remove, and a0.765 continuous electronics - dvd 0.586 0.738 latent variable z which models all the remain electronics - kitchen 0.536process 0.570 can 0.850 0.854 defined as: information. This generative be formally kitchen - books 0.560 0.593 0.720 0.709 kitchen - dvd 0.561 0.599 z ⇠ 0.733 p(z); 0.740x ⇠ p✓ (x|z, s) kitchen - electronics 0.533 0.565 0.838 0.843 where p✓ (x|z, s) is an appropriate probability distribution for the data we are modelling. With formulation we explicitly encode a notion of ‘invariance’ in our model, since the latent repres

19.

CVAEの活⽤ ¤ 条件付きVAEは，ラベル等に条件づけられた画像を⽣成できる ¤ 学習サンプルに存在していないデータも⽣成可能 ¤ 数字ラベルで条件付け[Kingma+ 2014 ; NIPS 2014] (a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z (b) MNIST analogies (c) SVHN analogies Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c) Analogical reasoning with generative semi-supervised models using a high-dimensional z-space. The leftmost columns show images from the test set. The other columns show analogical fantasies of x by the generative model, where the latent variable z of each row is set to the value inferred from

20.

Conditional alignDRAW ¤ Generating Images from Captions with Attention [Mansimov+ 16 ; ICLR 2016] ¤ DRAWにbidirectional RNNで条件づけたモデル ¤ DRAW [Gregor+ 14] ¤ VAEの枠組みでRNNを使えるようにしたもの． ¤ 各時間ステップで画像を上書きしていく ¤ 前のステップとの差分をみることで注意（attention）をモデル化 Recurrent Neural Network For Image Generation onstructs scenes s emitted by the encoder. s step by step is the scene while e past few years captured by a se, than by a sinchelle & Hinton, ; Ranzato, 2014; et al., 2014; Sered by sequential P (x|z) decoder FNN ct 1 DRAW: A Recurrent Neural Network For Image Generati ct write write decoder RNN decoder RNN z zt zt+1 sample sample sample hdec t 1 Q(zt |x, z1:t encoder FNN henc t 1 1) . . . cT Q(zt+1 |x, z1:t ) encoder RNN encoder RNN read read x x P (x|z1:T ) decoding (generative model) encoding (inference) Figure 2. Left: Conventional Variational Auto-Encoder. Dur- Time Figure 7. MNIST generation sequences for DRAW without attention. Notice how the network first generates a very blurry im-

21.

3.3 L EARNING The model is trained to maximize a variational lower bound L on the marginal likelihood of the correct image x given the input caption y: X L= Q(Z | x, y) log P (x | y, Z) DKL (Q(Z | x, y) k P (Z | y))  log P (x | y). (9) Conditional alignDRAW Z Similar to the DRAW model, the inference recurrent network produces an approximate posterior Q(Z1:T | x, y) via a read operator, which reads a patch from an input image x using two arrays of ¤ 1D Conditional alignDRAWの全体像 Gaussian filters (inverse of write from section 3.2) at each time-step t. Specifically, x̂t = x (ct 1 ), (10) ¤ DRAWにbidirectional RNNで条件づけたモデル Published as a conference paper at ICLR 2016 rt = gen read (x , x̂ , h t t t 1 ), ¤ Bidirectional RNNの出⼒を重み付け和したもので条件付ける． er er gen hinf = LSTM inf er (hinf t t 1 , [rt , ht 1 ]), ⇣ ⌘ inf er inf er Q(Zt |x, y, Z1:t 1 ) = N µ(ht ), (ht ) , (11) (12) (13) er where x̂ is the error image and hinf is initialized to the learned bias b. Note that the inference 0 inf er LSTM takes as its input both the output of the read operator rt 2 Rp⇥p , which depends on 𝑧 𝑦 original input image x, and the previous state of the generative decoder hgen the t 1 , which depends on the latent sample history z1:t 1 and dynamic sentence representation st 1 (see Eq. 3). Hence, the approximate posterior Q will depend on the input image x, the corresponding caption y, and the latent history Z1:t 1 , except for the first step Q(Z1 |x), which depends only on x. 𝑥 The terms in the variational lower bound Eq. 9 can be rearranged using the law of total expectation. 2: AlignDRAW model for generating images by learning an alignment between the input captions and Therefore, Figure the variational bound L is calculated as follows: generating canvas. " The caption is encoded using theTBidirectional RNN (left). The generative RNN takes a # latent sequence z1:T sampled from the prior along X with the dynamic caption representation s1:T to generate the canvas matrix cTlog , which to generate the final x (right). The inference RNN is used to L =EQ(Z p(xis then | y,used Z1:T ) D image (Q(Z t | Z1:t 1 , y, x) k P (Zt | Z1:t 1 , y)) y,x) 1:T |approximate compute posterior Q over the latent sequence. KL the t=2 3.2 I MAGE M ODEL : THE C ONDITIONAL DRAW N ETWORK DKL (Q(Z1 | x) k P (Z1 )) . To generate an image x conditioned on the caption information y, we extended the DRAW net1 work (Gregor et al., 2015) include captionGaussian representation hlang at eachmodel, step, as but shown Fig. 2. We also experimented with to a conditional observation it in worked The conditional DRAW network is a stochastic recurrent neural network that consists of a sequence Bernoulli of model. latent variables Zt 2 RD , t = 1, .., T , where the output is accumulated over all T time-steps. For simplicity in notation, the images x 2 Rh⇥w are assumed to have size h-by-w and only one color (14) worse compared to

22.

実験：キャプション付きMNIST ¤ キャプション付きのMNISTで学習 Published as a conference paper at ICLR 2016 ¤ キャプションはMNISTの場所を指定 Figure 6: Examples of generating 60 ⇥ 60 MNIST images corresponding to respective captions. The captions on the left column were part of the training set. The digits described in the captions on the right column were hidden during training for the respective configurations. ¤ 左が訓練データにあるもの，右はないもの． A PPENDIX A: MNIST W ITH C APTIONS ¤ 複数の数字でも適切に⽣成されている． As an additional experiment, we trained our model on the MNIST dataset with artificial captions. Either one or two digits from the MNIST training dataset were placed on a 60 ⇥ 60 blank image. One digit was placed in one of the four (top-left, top-right, bottom-left or bottom-right) corners of the image. Two digits were either placed horizontally or vertically in non-overlapping fashion. The corresponding artificial captions specified the identity of each digit along with their relative positions, e.g. “The digit three is at the top of the digit one”, or “The digit seven is at the bottom left of the image”. The generated images together with the attention alignments are displayed in Figure 6. The model

23.

実験：MSCOCOデータセット Published as a conference paper at ICLR 2016 ¤ キャプションの⼀部（下線部）だけを変換 A yellow school bus parked in a parking lot. A red school bus parked in a parking lot. A green school bus parked in a parking lot. A blue school bus parked in a parking lot. The decadent chocolate desert is on the table. A bowl of bananas is on the table. A vintage photo of a cat. A vintage photo of a dog. Figure 3: Top: Examples of changing the color while keeping the caption fixed. Bottom: Examples of changing the object while keeping the caption fixed. The shown images are the probabilities (cT ). Best viewed in Published as a conference paper at ICLR 2016 colour. ¤ 存在していないキャプションから⽣成 The expectation can be approximated by L Monte Carlo samples z̃1:T from Q(Z1:T | y, x): " # L T X 1X l l l L⇡ log p(x | y, z̃1:T ) DKL Q(Zt | z̃1:t 1 , y, x) k P (Zt | z̃1:t 1 , y) L t=2 l=1 DKL (Q(Z1 | x) k P (Z1 )) . (15) The model can be trained using stochastic gradient descent. In all of our experiments, we used single sample Q(Z1:T y, x) for parameter Training details, A stop only signa is flying in from A herd of | elephants fly- Alearning. toilet seat sits open in hyperparameter A person skiing on sand settings, and the overall model architecture are specified in Appendix B. The code isclad available at blue skies. ing in the blue skies. the grass field. vast desert. https://github.com/emansim/text2image. Figure 1: Examples of generated images based on captions that describe novel scene compositions that are 3.4 G ENERATING I MAGES FROM C APTIONS highly unlikely to occur in real life. The captions describe a common object doing unusual things or set in a the image generation step, we discard the inference network and instead sample from the strange During location. prior distribution. Due to the blurriness of samples generated by the DRAW model, we perform an additional post processing step where we use an adversarial network trained on residuals of a Lapla-

https://github.com/emansim/text2image

24.

ガウス過程とVAE

25.

[beta]

and Nickisch, 2015) and extensions in Wilson et al. (2015) for efficiently representing kernel
show that the proposed model outperforms state of the art stand-alone deep learning archifunctions,
to and
produce
scalable
kernels. kernel learning procedures on a wide range
tectures
Gaussian
processesdeep
with advanced

of datasets,
demonstrating
its
practical significance. We achieve scalability while retaining
ns
and
marginal
likelihood
for
Gaussian
processes
We Processes
review the
predictive
and marginal
for (Wilson
Gaussian proc
non-parametric
modelequations
structure by leveraging
the very recentlikelihood
KISS-GP approach
an
3briefly
Gaussian
Processes
al requirements, following
the notational
convenand Nickisch,
2015) and extensions
in Wilson et al. (2015) for efficiently representing kernel
(GPs), and the3associated
computational
requirements, following the notational con
Gaussian
Processes
to produce scalable
deepfor
kernels.
example, Rasmussen functions,
and Williams
(2006)
a

ガウス過程
ew
the
equations
and
marginal
likelihood
for Gaussian
processes
Wepredictive
review
equations
and example,
marginal
likelihood
for Gaussian
processes (2006) f
tions
inbriefly
Wilson
ettheal.predictive
(2015).
See,
for
Rasmussen
and Williams

We
briefly
review
theProcesses
predictive
equations
and
marginal
Gaussian processes
ecomprehensive
associated
computational
following
the notational
conven(GPs), and
thediscussion
associated
computational
requirements,
following
thelikelihood
notationalforconvenof GPs.
3 requirements,
Gaussian
and
associated
computational
requirements,
the notational
convennctor)
et tions
al.vectors
(2015).
Rasmussen
and Williams
(2006)
forWilliams
afollowing
in
etfor
See, for
example,
Rasmussen
and
(2006)
for a
¤ Wilson
ガウス過程とは・・・
XSee,
=(GPs),
{x
, .example,
.(2015).
. , xthe
of dimension
1al.
n }, each
tions We
inD
Wilson
et al.
(2015).
See, forvectors
example, Rasmussen
and
for a
>of
discussion
of GPs.
of
GPs.
briefly
review
the ⇠
predictive
equations
We
assume
a), .dataset
n
(predictor)
= {xfor
. . , xWilliams
}, each(2006)
of dimen
ets
ycomprehensive
= (y(x
.関数の確率分布
.discussion
, y(x ))
. If
finput
(x)
GP(µ,
k ), and marginalXlikelihood
1 , .Gaussian
nprocesses
¤
1

n
(GPs), and discussion
the associated
comprehensive
ofcomputational
GPs.

requirements, following the notational
conven>
has
aWe
joint
distribution,
tions
al.=(2015).
Rasmussen
Williams
for a
ataset
D assume
of nGaussian
input
(predictor)
{x1 , . .See,
.vectors
, xfor
each
dimension
dataset
D ofin vectors
nWilson
inputetX
(predictor)
X =of{x
each
dimension
11 , . . . , xand
n of (2006)
n },example,
¤a D次元の⼊⼒ベクトルのデータセット
n },に対する関数の出⼒
We
a dataset
n GPs.
input
(predictor)
vectors
= f{x
. . ,GP(µ,
xn }, each
of dimension
of
>X If
1, .⇠
x an n
1 vector
of
targets
(y(xdiscussion
), D
. . .of, y(x
If f1(x)
GP(µ,
D,⇥which
index
anassume
ncomprehensive
⇥ y1 =
vector
targets
y))=> .の同時分布が常にガウス分布
(y(x
), . . .⇠, y(x
(x)
k
),
ベクトル
1of
n
n )) k. ),
> . If f (x) ⇠ GP(µ, k ),
>
D,
which
index
an
n
⇥
1
vector
of
targets
y
=
(y(x
),
.
.
.
,
y(x
))
1
n
We
assume
a
dataset
D
of
n
input
(predictor)
vectors
X
=
{x
,
.
.
.
,
x
},
each
of dimension
1
n
. . . , of
f (x
)] collection
⇠values
N (µ,f K
, values
(1)Gaussian distribution,
tion
function
has
a )joint
Gaussian
distribution,
then
of
function
f has
a joint
nany
X,X
>
D, which
index an
⇥ 1 vectorvalues
of targets
y=a
(y(x
. , y(xn )) . If
f (x) ⇠ GP(µ, k ),
then any
collection
ofnfunction
f has
joint
distribution,
1 ), . . Gaussian
>
then any collection
Gaussian distribution,
> of function1 values f has
> a joint
n
X,X
f
=
f
(X)
=
[f
(x
),
.
.
.
,
f
(x
)]
⇠
N
(µ,
K
)
,
(X)
. . . X,X
, f (xn )] ⇠ N (µ, KX,X
,
(1)
1 )fij==f k
n(x=
> )(1)
1 ),
ariance matrix, (KX,X
xf(x
determined
i ,[f
j ),
=
f
(X)
=
[f
(x
),
.
.
.
,
f
(x
⇠
N
(µ,
K
)
,
(1)
> n )]
1
X,X
f = f (X) = [f (x1 ), . . . , f (xn )] ⇠ N (µ, KX,X ) ,
(1)

D, which index an n ⇥ 1 vector of targets y = (y(x ), . . . , y(x )) . If f (x) ⇠ GP(µ,
then any collection of function values f has a joint Gaussian distribution,
f = f (X) = [f (x ), . . . , f (x )] ⇠ N (µ, K

),

kernel of the Gaussian process. The kernel, k , is

ector,
µai =
and
covariance
matrix,
(K
)i ),ij),2and
=
kcovariance
(xi , xmatrix,
determined
withwith
mean
µ(x
),µand
covariance
(K
==
kkdetermined
(xi ,ix, jx),j ),
determ
aµ(x
mean
vector,
µµ
µ(x
covariance
matrix,
(K
)(K
= k)ijX,X
(x
determined
i ),vector,
j ),
i ),
ij X,X
X,X
iand
ij)jiij),
i i=
with
a=
mean
vector,
µ
=X,X
µ(x
matrix,
=X,X
ki ,)x
(x
,x
a(x)
mean
vector,
µ(x
and
covariance
matrix,
(K
(x
determined
j ),で完全に記
i ),
¤with
平均ベクトルは
，共分散⾏列は
ifi=
Gaussian
noise,
y(x)|f
⇠
N
(y(x);
(x),
the
15.2. GPs
forcovariance
regression
517
function
and
covariance
kernel
of
the
Gaussian
process.
The
kernel,
k Gaussian
, is
from
the
mean
function
and
kernel
ofthe
theofGaussian
process.
The
kernel,kThe
kThe
from
the
meanfrom
function
and
covariance
kernel
of the
Gaussian
process.
The
kernel,
, isiskernel,
from
the
mean
function
and
covariance
kernel
of
Gaussian
process.
kernel,
the
mean
function
and
covariance
kernel
the
process.
k , isk
述される
2
ed
at
the
n
test
points
indexed
by
X
,
is
given
by
2
2
parametrized
by
.
Assuming
additive
Gaussian
noise,
y(x)|f
(x)
⇠
N
(y(x);
f
(x),
),
the
⇤ additive
⇤ Gaussian
y . parametrized
Assuming
noise,
y(x)|f
(x) ⇠ N
(y(x);y(x)|f
f (x),(x)
),⇠the
by
. Gaussian
additive
noise,
N
(y(x);
f (x),
), thef (x), 2 ), the
2)
parametrized
by
. additive
Assuming
noise,
y(x)|f
(x)
⇠⇤ ,N
N
(y(x);
predictive
distribution
of the GPadditive
evaluatedGaussian
at the
n⇤ test
points
indexed
by ⇠
X
is given
by f (x),
parametrized by
. Assuming
Assuming
Gaussian
noise,
y(x)|f
(x)
(y(x);
ibution
of the GP
evaluated
theGP
n⇤ evaluated
testofpoints
is given
bypoints
predictive
distribution
ofatdistribution
the
atindexed
the
n⇤ by
testXat
points
by Xindexed
⇤ ,the
predictive
the GP
evaluated
nindexed
by by
X⇤ , is given by
⇤ , is given
⇤ test
2

2.5
2

1.5

2 at the n
N (E[f ⇤ ], cov(f
predictive
distribution
of the GP fevaluated
⇠ N (E[f ⇤ ], cov(f
(2) X⇤ , is give
⇤ )) ,
⇤ test
⇤ |X⇤ ,X, y, , (2)
⇤ )) , points indexed by
1.5

1

2

1

2 ,1 ⇠
f ⇤ |X⇤+
,X, y,
I] y ,

X⇤ ,X [KX,X

2 E[f ] = µ + K2
1
y ,(2)
N
, ⇤⇤,X,
y,
, cov(f
N
(E[f
f ⇤(E[f
|X⇤ ,X,
y, , ⇤f ))
⇠
N (E[f
, +⇤ ],2 I]cov(f
X⇤⇠
,X [K
X,X
X⇤ ⇤ ],
⇤ ], cov(f
⇤ |X
⇤ )) ,
⇤ ))
0.5

(2)

0.5

0

2

2

0

(2)

1

KX⇤N
KX⇤2,X],[Kcov(f
I]
2,KX,X
1⇤ .
2 cov(f
1 ⇤ )] =
,X⇤(E[f
X,X +
1
fE[f
|X
,
⇠
))
E[f
=
µ
+
K
[K
+
I]
y,
E[f ⇤ ] = µX⇤2+ K1X⇤ ,X
+Xy,
I]
y
,
⇤[K
⇤
⇤
] ⇤=,X,
µ
+
K
[K
+
I]
y
,
⇤
X
,X
X,X
X
⇤
⇤X,X
X
,X
X,X
⇤
⇤
⇤

KX⇤ ,X [KX,X +

I]

cov(f ⇤ ) = KX⇤ ,X⇤

−0.5

−0.5

K , for
⇤ . example, is an n
K X,X

−1

−1

2

1

n matrix ofK
covariances
GP
at X⇤
2 ) 1=
1 X,X
⇤ ⇥K
X⇤ ,X
2between
cov(f
[K
+⇤1.ythe
I]
Kevaluated
⇤ KK
X[K
X⇤2,X
X,X⇤ .
K
[K
+
I]
K
.
cov(f
)
=
K
+
I]
K
⇤⇤,X
⇤
X
,X
X,X
X,X
⇤
X
,X
X
,X
X,X
X,X
E[f
]
=
µ
+
[K
+
I]
,
⇤
⇤
⇤
⇤
and X. µ⇤ is the nX⇥ 1 mean X
vector,
K
is the n ⇥ n covariance matrix evaluated
⇤ ,X andX,X
−1.5

X⇤

⇤

⇤

−2
−5

−1.5

0

X,X

5

−2
−5

0

5

(a)
(b)
at training
inputsevaluated
X. All covariance
(kernel)
matrices implicitly depend
on the kernel
x of covariances between
the
GP
at
X
⇤
2
1
K
,
for
example,
is
an
n
⇥
n
matrix
of
covariances
between
the
GP
at X⇤
⇤ covariances
X⇤ ,Xhyperparameters
cov(f
=nKmatrix
K
[K
K
.Figure
mple,Kis
n matrix
theX
GP
evaluated
atthe
XI]⇤GP
isofancovariances
n⇤⇤) ⇥
between
evaluated
at evaluated
X⇤a GP
⇤ ⇥example,
Xbetween
X,X
X,Xsome
X⇤an
,X , nfor
15.2⇤of
Left: some
functions
from a +
GP prior
with
SE kernel.
Right:
from
⇤ ,X
⇤ ,X sampled
⇤ .samples
nd KX,X is the nand
⇥ nX.covariance
evaluated
µX⇤ is the matrix
n⇤posterior,
⇥ 1 mean
vector,
Kobservations.
n ⇥area
n represents
covariance
matrix
after
conditioning
on 5 and
noise-free
The shaded
E [f (x)]±2std(f
(x). evaluated
X,X is the

26.

[beta]

EEP

2

D EEP G AUSSIAN P ROCESSES

2

D EEP G AUSSIAN P ROCESSES

G AUSSIAN Gaussian
P ROCESSES
processes provide flexible, non-parametric, probabilistic
approaches
to function
Gaussian
processes
provide estimaflexible, non-parame

tion. However, their tractability comes at a price: they can only
a restricted
class comes
of
tion.represent
However,
their tractability
at a pric
n processes provide
flexible,
non-parametric,
probabilistic
approaches
to
function
estimafunctions. Indeed, even though sophisticated definitions and combinations
of
covariance
functions
functions. Indeed, even though sophisticated defi
However, their tractability
comes
at
a
price:
they
can
only
represent
a
restricted
class
of
can lead to powerful models (Durrande et al., 2011; Gönen & Alpaydin,
Hensman
al.,
can lead to 2011;
powerful
models et
(Durrande
et al., 2
ns. Indeed, even though sophisticated definitions and combinations of covariance functions 2013; Duvenaud et al., 2013; Wilson & Adams, 2
2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribud to powerful models (Durrande et al., 2011; Gönen & Alpaydin, 2011; Hensman et al.,
of instantiations
the latentOne
function remain
tionWilson
of instantiations
of thethelatent
function
remains;
this limits
thetion
applicability
of theofmodels.
Duvenaud et al., 2013;
& Adams, 2013),
assumption
about
joint normal
distribuline ofcomposition
recent research
to address
this limitation
of function
recent research
to address
limitation
onOne
function
(Snelson
et al.,
nstantiations of¤
theline
latent
remains; this
limits thethis
applicability
of focused
the models.
より複雑なサンプルを表現するため，process
compositionによって
2004; Calandra et al., 2014). Inspired by deep n
et al.,
2014).on Inspired
by deep neural
networks,
recent research to 2004;
addressCalandra
this limitation
focused
function composition
(Snelson
et al., a deep Gaussian process instead
employs
process
(Lawrence & Moor
多層化する[Lawrence
&a deep
Moore,
07]
Calandra et al., 2014).
Inspired
by deep
neural networks,
Gaussian
process
employs
process
composition
(Lawrence
&
Moore,
2007; instead
Damianou
et al.,
2011;composition
Lázaro-Gredilla,
2012; Damianou & Lawrence, 2013; Hensman &
s process composition
(Lawrence
& Moore,
2007; Damianou
et al., 2011;
Lázaro-Gredilla,
2012;
Damianou
& Lawrence,
2013;
Hensman
&
Lawrence,
2014).
➡2014).
深層ガウス過程（deep AGP）
Damianou & Lawrence, 2013; Hensman & Lawrence,
deep GP is a deep directed graphical model th
A deep GP is a deep directed graphical model that consists of multiple
layers of latent variables
GP is a deep directed graphical model that consists of multiple layers of latent variables and employs Gaussian processes to govern the m
and employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &
ploys Gaussian processes to govern the mapping between consecutive layers (Lawrence & Moore, 2007; Damianou, 2015). Observed output
Moore,Observed
2007; Damianou,
2015).inObserved
outputs
in the
down-most
layer
andinobserved
inputs
(if any) are
placed
the upper-most layer, a
2007; Damianou, 2015).
outputs are placed
the down-most
layerare
andplaced
observed
N
⇥D
¤
以下のように，多層グラフィカルモデルを考える
(if any) are
placed
in the upper-most
layer,formally,
as illustrated
in Figure
More
consider
a set of 1.
data
Y 2formally,
R
with
N datapoints and D
if any) are placed ininputs
the upper-most
layer,
as illustrated
in Figure 1. More
consider
L
N ⇥Ql
N ⇥D a set of data Y 2 RN ⇥D with N datapoints and D dimensions. Alatent
{Xl }l=1L
, Xlayers
through th
GP then defines
l 2 R of
data Y 2 R
with N datapoints and D dimensions. A deep GP then defines L layers of deep variables,
¤
ここでは𝑌がデータ，𝑋が潜在変数．
L
N
⇥Q
N ⇥Ql
l
latent
{Xlthe
}l=1
, Xl 2 nested
R
through
following nested noise model definition:
ariables, {Xl }L
through
following
noise
modelthe
definition:
l 2 Rvariables,
l=1 , X
Y = f1 (X1 ) + ✏1 , ✏1
2
Y = f1 (X
✏ 1 , ✏1 ⇠ N
(0, =1at
I)
(1)
Y
f1ICLR
(X1 ) +
✏1 , ✏1 ⇠ N (0, 2 I)
(1)l ) + ✏l , ✏l ⇠
Xl 1 = fl (X
Published
as1 )a+conference
paper
2016

深層ガウス過程

Xl

= fl (Xl ) + ✏l ,

2
l I),

✏l ⇠ N
l = 2 . . . L ✏ ⇠ N (0,
X(0,
l 1 = fl (Xl ) + ✏l ,
l

1

2 (2)
l I),

l = 2 .the
..L
(2) Gaussian pr
where
functions fl are drawn from
he functions fl are drawn from Gaussian processes with covariance functions kl , i.e. fl (x) ⇠ GP(0, kl (x, x0 )). In the unsupervised case, the
where the functions
drawnlayer
fromis Gaussian
withascovariance
functions kl ,prior
i.e. fwhich
l are
l (x) ⇠also provides
a fairly uninformative
kl (x, x0 )). In the unsupervised
case,
theftop
hidden
assigned aprocesses
unit Gaussian
0
GP(0,
(x, xprovides
)). In soft
the regularization,
unsupervised i.e.
case,
layer
is assigned
a unit
Gaussian
as
supervised
learning
scenario,
the inputs
of the to
uninformative prior
whichklalso
XLthe
⇠ top
N (0,hidden
I). In the
a fairly
uninformative
whichis also
provides
soft regularization,
⇠ GP
f2and
⇠ govern
GP
GPXL ⇠ N (0, I). In the
sed learning scenario,
the inputs
of the topfhidden
layer
observed
its hidden foutputs.
1prior
3 ⇠ i.e.
1

X3
X2 of the top hiddenXlayer
1
supervised learning
scenario, the inputs
is observed andYgovern its hidden
The expressive power of a deep GP is significantl
outputs.
pressive power of a deep GP is significantly greater than that of a standard GP, because the successive warping of latent variables through the

ive warping of latent
variables
through
the hierarchy
allows
forsignificantly
modeling
non-stationarities
and of
sophisticated,
functional “fea
The
expressive
power
of1:a deep
GPGaussian
is
greater
thanhidden
that
a standard non-parametric
GP, because the
Figure
A deep
process
with two
layers.
histicated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is the limit of an infinitely wide neural network, a de
successive warping of latent variables through the hierarchy allows for modeling non-stationarities
t of an infinitely wide neural network, a deep GP is the limit where the parametric function composition of a deep neural network turns into a
andnetwork
sophisticated,
functional
“features”
(see
Figure 2). Similarly to how a GP is
ition of a deep neural
turns intonon-parametric
a process composition.
Specifically,
a deep
neural
network can be written as:

27.

深層ガウス過程からのサンプリング

28.

VAE-DGP ¤ DGPで変分推論する枠組みは提案されている[Damianou & Lawrence 13]が，少ないデータでしか学習できなかった． ¤ 共分散⾏列の逆数や，膨⼤なパラメータのため． ¤ DGPの推論をVAEの識別モデル（エンコーダー）と考える． ¤ 制約が加わり，パラメータを減らして推論が速くなる． ¤ 従来のDGPより過学習を抑えられる． Published as a conference paper at ICLR ➡VAE-DGP 2016 Variationally Auto-Encoded Deep Gaussian Processes [Dai+ 15; ICLR 2016] X3 f1 ⇠ GP (n) {g3 (µ2 )}N n=1 X2 f2 ⇠ GP (n) {g2 (µ1 )}N n=1 X1 f3 ⇠ GP Y {g1 (y(n) )}N n=1 Figure 3: A deep Gaussian process with three hidden layers and back-constraints.

29.

実験：⽋損補間 ¤ テストデータの⽋損補間 Published as a conference paper at ICLR 2016 ed as a conference paper at ICLR 2016 ¤ 各例の右端が元画像 5.1 U NSUPERVISED L EARNING Model DBN Stacked CAE Deep GSN Adversarial nets GMMN+AE VAE-DGP (5) VAE-DGP (10-50) VAE-DGP (5-20-50) MNIST 138±2 121 ± 1.6 214 ± 1.1 225 ± 2 282 ± 2 301.67 674.86 723.65 Table 1: Log-likelihood for the MNIST test data with different models. The baselines are (a) (b) DBN and Stacked CAE (c) (Bengio et al., 2013), Deep GSN (Bengio et al., 2014), Adversarial 5: (a) The samples generated from VAE-DGP trained on the combination Frey faces and nets (Goodfellow et al., 2014)ofand GMMN+AE ces (Frey-Yale). (b) Imputation from the test set (LiofetFrey-Yale. al., 2015). (c) Imputation from the test Figure 6: Samples of imputation on the test sets. The gray color indicates the missing area. The 1st column shows the input images, the 2nd column show the imputed images and 3rd column shows the original full images. VHN. The gray color indicates the missing area. The 1st column shows the input images, We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey column show the imputed images and 3rd column shows the original full images. faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images, which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images randomly from Yale faces as the test set and use the rest for training. The intensity of the original KF1 F1 , KU1 U1 are the covariance matrices of F respectively, KF1 U1tois[0,the 1 and U1images gray-scale are normalized 1].crossThe applied VAE-DGP has two hidden layers (a 2D top nce matrix between F1 and U1 , and 1 = Tr(hK ), = hKF and F1 F 1 middle 1 iq(X 1 U1 iq(X ) ) hidden layer and a 20D hidden layer). The exponentiated quadratic kernel is used for all the 1 1 ⌦ > ↵ layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with KF U KF 1 U 1 , and ⇤1 = KU1 U1 + 1 . This enables data-parallelism by dis1 1 q(X )

30.

Published as a conference paper at ICLR 2016 実験：精度評価 ¤ 対数尤度（MNIST） 5.1 U NSUPERVISED L EARNING Model DBN Stacked CAE Deep GSN Adversarial nets GMMN+AE VAE-DGP (5) VAE-DGP (10-50) VAE-DGP (5-20-50) MNIST 138±2 121 ± 1.6 214 ± 1.1 225 ± 2 282 ± 2 301.67 674.86 723.65 Figure 6: Samples of imputation on the test Published as a conference paper at ICLR 2016 Table 1: Log-likelihood for the MNIST test sets. The gray color indicates the missing data with different models. The baselines are area. The 1st column shows the input imDBN and Stacked CAE (Bengio et al., 2013), ages, the 2nd column show the imputed imDeep GSN (Bengio et al., 2014), Adversarial ages and 3rd column shows the original full nets (Goodfellow et al., 2014) and GMMN+AE images. ¤ データセット： (Li et al., 2015). Model Abalone We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey ¤ The Abalone dataset VEA-DGP 825.31 ± 64.35 faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images, GP 888.96 ± which are resized to 20 ⇥ 28. We 78.22 take the last 200 frames from the Frey faces and 300 images Lin. Yale Reg. faces917.31 randomly from as the ± test53.76 set and use the rest for training. The intensity of the original gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top ¤ The Creep dataset hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the Model Creep layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with VEA-DGP 575.39 ±model, 29.10we can draw samples from the learned model by sampling widths (500-300). As a generative GP 602.11 ± 29.59 first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case) Lin.downwards. Reg. 1865.76 ± 23.36 images are shown in Figure 5a. and layer-wise The generated ¤ 教師あり学習（回帰） To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST (LeCun 1998).obtained We usefrom the whole training set for learning, which consists of 60,000 28 ⇥ 28 Tableet2:al.,MSE our VEA-DGP, images. The intensity of the original gray-scale images are normalized to [0, 1]. We train our model Figure 7: Bayesian optimization experiments for standard GP and linear regression for the withAbalone three different model settings (one, two and three hidden layers). The trained models are and Creep benchmarks.

31.

Variational Models 変分推論における平均場近似 We want to compute posterior p(z|x) (z: latent variables, x: data) log 𝑝 𝑥 ¤ VAEでは近似分布は𝑞(𝑧|𝑥)と考えてきた • Variational inference seeks to minimize KL(q(z; = 𝐿 𝑥)||p(z|x)) + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) ¤ 𝑞(𝑧|𝑥)はニューラルネットワークで表現 for a family q(z; ) • ¤ ⼀般的に近似分布は平均場近似によって近似される． • Maximizing evidence lower bound (ELBO) log p(x) • Eq(z; ) [log p(x|z)] (Common) Mean-field distribution KL(q(z; )||p(z)) Y q(z; ) = q(z i ; i) i • (Newer) Interpret the family as a variational model for posterior ¤ もっとリッチな近似分布を考えることもできる variables z (introducing new latent variables)[1] ¤ latent パラメータ𝜆を確率変数として事前分布を考える（階層変分モデル） • Hierarchical variational models Lawrence, N. (2000). Variational Inference in Probabilistic Models. PhD thesis. 4

32.

Variational Gaussian Processes Variational Gaussian Processes Variational Gaussian Processes 変分ガウス過程 Variational Gaussian Processes ¤ The Variational Gaussain Process [Tran+ 15; ICLR 2016] ¤ とても強⼒な変分モデルを提案 ¤ を変分データ（パラメータ）とし，次のような𝑧の⽣成過程を考える． ¤ 潜在変数 ¤ ⾮線形写像をDによって条件づけられたガウス過程から⽣成 ¤ 潜在変数zを⽣成 Variational Gaussian Processes 7 7 7

33.

[beta]

variational distribution. (This idea appears in a different context in Blei & Lafferty (2006).) The
VGP specifies the following generative process for posterior latent variables z:
1. Draw latent input ⇠ 2 Rc : ⇠ ⇠ N (0, I).

変分ガウス過程の尤度

Qd

Draw mappings
non-linear mapping
f :R !R
conditioned
on latent
D: f ⇠variable
quence of 2.
domain
during inference,
from
variational
spaceK⇠⇠ ) | D.
i=1 GP(0,
Qdthe
r latent variable space Q to data space P. We perform variational inference in
3. Draw approximate posterior samples z 2 supp(p): z = (z1 , . . . , zd ) ⇠ i=1 q(fi (⇠)).
e and auxiliary inference in the variational space.
¤ 全ページの⽣成過程から，潜在変数𝑧の周辺尤度は
c

d

Figure 1 displays a graphical model for the VGP. Marginalizing over all non-linear mappings and
latent inputs, the VGP is
resses the task of posterior inference
by learning
f ⇤ : conditional on variational
data
"
#
"
#
Z
Z
d
ameters to learn, the distributionY
of
the GP learns Y
tod concentrate around this optimal
✓, D) = provides
q(zintuition
GP(f
D N (⇠; 0, I) df d⇠,
(4)
VGP (z;perspective
i | fi (⇠)) behind
i ; 0, K⇠⇠ ) |result.
ng inference.qThis
the following
i=1

i=1

Universal approximation). Let q(z; ✓, D) denote the variational Gaussian process. For
which isp(z
parameterized
kernel
hyperparameters
✓ and variational
data.
distribution
| x) with aby
finite
number
of latent variables
and continuous
quantile
rse CDF),
there existmodel,
a set the
of parameters
(✓,
D) such
that of mean-field distributions. A mean-field
As a variational
VGP forms an
infinite
ensemble
¤ このようにしてモデル化した近似分布は，𝑝(𝑧|𝑥)がどんな分布であろ
distribution
is specified
conditional
on a| fixed
function
f (·) and input ⇠; the d outputs fi (⇠) = i are
KL(q(z;
✓, D) k p(z
x)) =
0.とするパラメータが存在する
うと，
the mean-field’s parameters. The VGP is a form of a hierarchical variational model (Ranganath et al.,
（Universal
Approximation）
2015); it places
a continuous
Bayesian nonparametric prior over mean-field parameters.
x B for a proof.
Theorem 1 states that any posterior distribution with strictly posi¤ つまり，これまでのどの⼿法よりも限りなく柔軟なモデルとなる．
that the VGP
the d
a GP at the
samefor
latent
input posterior
⇠, which induces coran beNote
represented
by aevaluates
VGP . Thus
thedraws
VGPfrom
is a flexible
model
learning
relation between their outputs, the mean-field parameters. In再掲
turn, this induces correlation between
latent variables of the variational model, correlations that arelog
not𝑝 captured
in classical mean-field.
𝑥
Finally, the complex non-linear mappings drawn from the GP
make the VGP a flexible model for
= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
complex
discrete and continuous posteriors.
BOX
INFERENCE

We emphasize that the VGP needs variational data because—unlike typical GP regression—there is
IONAL
no OBJECTIVE
observed data available to learn a distribution over non-linear mappings. The variational data

34.

derive a tractable variational objective inspired by auto-encoders. B LACK BOX INFERENCE Under review as a conference paper at ICLR 2016 Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting KL divergence term from the ELBO: 3.1 anVexpected ARIATIONAL conference paper atOBJECTIVE ICLR 2016 h i EqVGPfor [log p(x | z)] black KL(qVGP EqVGPa wide KL(q(⇠, f |of z)kr(⇠, f | z))models. , We derivelog anp(x) algorithm performing box(z)kp(z)) inference over class generative 3 w as a 下界 The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP (z) (Eq.4). We where r(⇠, f | variational z) is an auxiliary model. Such an has been considered independently by Salderive a tractable objective inspired byobjective auto-encoders. 3.2 and AUTO - ENCODING VARIATIONAL MODELS imans et al. (2015) Ranganath et al. (2015). Variational inference is performed in the posterior Specifically, a tractable bound KL(qkp) to the model evidence log p(x) model; can be derived byoccur subtracting latent variable space,lower minimizing to learn the variational for this to auxil¤ 学習では次の下界を最⼤化する iary inference is performed in from the variational variable space, minimizing KL(qkr) to learn an an expected KL divergence term theprovide ELBOlatent : a flexible Inference networks parameterization of approximating ENCODINGauxiliary VARIATIONAL MODELS model. See Figure 2. h 1994), deep Boltzmann i in Helmholtz machines (Hinton & Zemel, machin log p(x) E [log p(x | z)] KL(q (z)kp(z)) E KL(q(⇠, f | z)kr(⇠, f | z)) , qVGPapproaches, we rewrite VGP VGP Unlike previous this variational qobjective to connect to auto-encoders: Larochelle, 2010), and variational auto-encoders distributions (Kingma & Welling, 2014; R tworks provide a flexible parameterization of approximating as used h parameters with global parameters coming i replaces local variational from where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently e=& z machines (Hinton Boltzmann machines & a ne L Eq Zemel, [log p(x |1994), z)] Eq deep KL(q(z | f (⇠))kp(z)) + KL(q(⇠, f(Salakhutdinov )kr(⇠, f | z)) , by Sal(5) ically, for latent(Kingma znVariational which correspond a data point xposterior imans et al. (2015) and Ranganath etvariables al. (2015). inference isto performed in the n , anItinfere 010), and variational auto-encoders & Welling, 2014; Rezende et al., 2014). the KL are now takento over tractable distributions (seelocal Appendix In auxilautolatentwhere variable space, minimizing KL(qkp) learn variational model; for this to C). occur a divergences neural network which takes xthe input and its variational parameter n as 補助モデル l variational parameters with global parameters coming from a neural network. Specifwe maximize the expected negative reconstruction error, regularized an exiary encoder inferenceparlance, is amortizes performed in the variational latent variable space, minimizing KL(qkr) tobylearn an 再構成誤差正規化項 inference by only defining a set of global parameters. pected divergence between original model’s network prior, and an expected ent variables znmodel. which to a datamodel pointandxnthe, an inference specifies auxiliary Seecorrespond Figure 2. the variational divergence between the auxiliary model and the variational model’s prior. This is simply a nested auto-encode thelocal VGP we specify objective inference toauto-encoders: parameterize work which takes xn To as input and variational parameters This bo n astoaoutput. Unlike previous approaches, we its rewrite this variational tonetworks connect instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): KL divergence イメージとしては次のような感じ auxiliary models. fromas other auto-encoder we erence by¤only defining a setmodel of global parameters. between the inference and a Unique prior regularizers on both theapproaches, posterior and variational h is taken i let the aux VGP VGP spaces. interpretation proposed bound for zvariational point xKL(q(z and variational data point as input: = This EqVGPobserved [log p(x |data z)] justifies EqVGP the | f (⇠))kp(z)) + KL(q(⇠, fn )kr(⇠, f models; | z)) , as we (5) n previously ¤Le近似モデルでxからzを⽣成 shall see, it also enables lower variance gradients during stochastic optimization. VGP we specify inference networks to parameterize both the variational and ode the xn over 7! q(z ✓ ), x(see 7! r(⇠nC). , fnIn| xauton | xn ;distributions n , zn n , zn ; n where from the KLother divergences are now taken tractable Appendix dels. Unique auto-encoder approaches, we letn the auxiliary model take both encoder parlance, we maximize the expected negative reconstruction regularized by an ex-Dn , a where q has local variational parameters given error, by the variational data ¤ 補助モデルでxとzから写像と潜在変数を⽣成 5 a point xpected and variational data point z as input: n n divergence between the variational model the original model’s prior, and an expected fully factorized Gaussian withand local variational parameters (µn 2 R n = divergence between the auxiliary model and the variational model’s prior. This is simply a nested xn 7! q(zn | xNote ), by letting xn , znr’s 7!inference r(⇠ n , fnnetwork | xn , zntake ; n ), n ; ✓ nthat both xn aand n as input, w instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): KL zdivergence explicit specification r(✏,asf regularizers | z). This suggested butas nota imple local variational given variational dataidea Dboth , and r is specified nwas between theparameters inference model and a by priorthe isoftaken on thefirst posterior and variational 1 et al.variational (2015). spaces.with This local interpretation justifiesparameters the previously proposed variational ed Gaussian Rc+d , 2n 2models; Rc+das ). we n = (µbound n 2 for shallinference see, it also enables lower variance stochastic letting r’s network take bothgradients x andduring z as input,optimization. we avoid the restrictive

35.

実験：対数尤度 Published as a conference paper at ICLR 2016 ¤ 前⼈未到の70代に突⼊ ¤ ⽣成部分のモデルをDRAW, 近似分布をVGPとしたモデルが⼀番良い Model DLGM + VAE [1] DLGM + HVI (8 leapfrog steps) [2] DLGM + NF (k = 80) [3] EoNADE-5 2hl (128 orderings) [4] DBN 2hl [5] DARN 1hl [6] Convolutional VAE + HVI [2] DLGM 2hl + IWAE (k = 50) [1] DRAW [7] DLGM 1hl + VGP DLGM 2hl + VGP DRAW + VGP log p(x) 85.51 84.68 84.55 84.13 81.94  86.76 88.30 85.10 83.49 82.90 80.97 84.79 81.32 79.88 Table 1: Negative predictive log-likelihood for binarized MNIST. Previous best results are [1] (Burda et al., 2016), [2] (Salimans et al., 2015), [3] (Rezende & Mohamed, 2015), [4] (Raiko et al., 2014), [5] (Murray & Salakhutdinov, 2009), [6] (Gregor et al., 2014), [7] (Gregor et al., 2015).

36.

まとめ ¤ 今回はICLR2016のVAE研究を中⼼に ¤ 変分推論とVAE ¤ 条件付きVAEと半教師あり学習 ¤ ガウス過程とVAE についてまとめた ¤ 感想 ¤ ICLRの傾向がなんとなくわかった ¤ まとめるのが難しかった

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】拡散モデルの数理

【拡散モデル勉強会】Introduction to Diffusion Models

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

【拡散モデル勉強会】拡散モデルのサンプラーまとめ

各ページのテキスト