[Dl輪読会]bayesian dark knowledge

280 Views

August 19, 16

#Bayesian Dark Knowledge #SGLD #Distillation #Machine Learning #Knowledge Transfer

スライド概要

2016/8/19
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 92.4K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 71.7K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.6K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 55.2K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 52.2K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 50.2K

各ページのテキスト

Bayesian Dark Knowledge August 18, 2016 ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Contents 1 Introduction 2 Background Knowledge 3 Bayesian Dark Knowledge 4 How to improve the original Bayeisan Dark Knowledge ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Introduction Introduction ”Bayesian Dark Knowledge” is a method unifying SGLD with distillation. SGLD is a method for learning large-scale Bayesian models like Bayeisn Networks. SGLD makes it possible to avoid overfitting. Distillatoin is a method for training student networks using soft labels created by teacher networks. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Background Knowledge SGLD MLA(Metropolis Adjusted Langevin Dynamics) The objetive: sample from p(θ), which is often p(θ|Data). The method is based on a Langevin diﬀusion, with stationary distribution p(θ) defined by 1 dθ(t) = ∇θ L(θ(t))dt + N(0, Idt). 2 But, isotropic diﬀusion is ineﬃcient. So, pre-conditioning matrix M is introduced. M is usually set to the inverse of the Fisher information matrix. 1 dθ(t) = M∇θ L(θ(t))dt + N(0, Mdt). 2 ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Background Knowledge SGLD SGLD SGLD is a method combining with SGD and MLA. The formula is as follows: ∆θt = ϵt N∑ (∇ log p(θt )+ ∇ log p(xti |θt ))+ηt , ηt ∼ N(0, ϵt ). 2 n Note that in the SGD, the noise term is removed. Rejection rates go to zero asympotically. In the initial phase, SGD like update. In the latter phase, MLA like update. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Background Knowledge Distillation By learning the ensembles networks or the large networks, we can get the good accuracy. The above networks are called teacher networks. However, the model size is large. After learning the teacher networks, we want to transfter the knowledge in a function into a single smaller model. When trasnfering the knowledge, it is better to use soft targets, which are created by teacher networks, instead of the original labels, i.e., hard targets. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Bayesian Dark Knowledge Overview Overview Bayesian Dark knowledge is a method of combining SGLD with the concept of distillation. SGLD is a useful method for learning Bayeisan Deep Networks. The problem is that SGLD needs to archive many copies of parameters. The motivation is replacing a set of neural networks with a single deep network. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Bayesian Dark Knowledge Methods ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Bayesian Dark Knowledge Bayesian Dark Knowledge Algorithm Algorithm ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

10.

Bayesian Dark Knowledge Bayesian Dark Knowledge Points The method does not require to archive the weights. In the distillation phase, θ is updated online. The variance of the prior of teacher networks is smaller than the variance of the prior of student networks. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

11.

Bayesian Dark Knowledge Bayesian Dark Knowledge Results ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

12.

Bayesian Dark Knowledge How to improve the original Bayeisan Dark Knowledge How to improve? SGLD phase Slow mixing rate. The above method does not consider the local geometric structure. Distillation phase We do not have the knoweledge about p(x). We sample from the actual data only. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

13.

Bayesian Dark Knowledge How to improve the original Bayeisan Dark Knowledge Preconditioned SGLD p-SGLD That combines RMSprop with Riemannian SGLD. RMSprop is an method of adaptive learning rate considering the curvature. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

14.

Bayesian Dark Knowledge How to improve the original Bayeisan Dark Knowledge References A. Korattikara et.al ”Baysian Dark Knowledge” G. Hinton ”Distilling the Knowledge in a Neural Network” C. Li et.al ”Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks” ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌