[DL輪読会]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

>100 Views

July 29, 16

スライド概要

2016/7/29
Deep Learning JP:
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

論文紹介 Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research arXiv:1502.01852, ICCV2015 選んだ理由 ResNetと同じ著者(全員) サイテーション 380+ 内容はCNNに関する事, ImageNet 2012 dataset 1. PReLUの話 2.ウェイトの初期化の話 (ResNetでも使用) (精度は著者本人達のResNet(2015.12)によって更新されている)

2.

PReLU a aの値はback propagationで調整する この論文では初期値 a = 0.25を使用 特に a > 0という条件は課さなかった channel wise f (yi ) = max(0, yi ) + ai min(0, yi ) channel shared f (yi ) = max(0, yi ) + a min(0, yi )

3.

学習された aの値 layer conv1 pool1 conv21 conv22 conv23 conv24 pool2 conv31 conv32 conv33 conv34 conv35 conv36 spp fc1 fc2 fc3 7⇥7, 64, /2 3⇥3, /3 2⇥2, 128 2⇥2, 128 2⇥2, 128 2⇥2, 128 2⇥2, /2 2⇥2, 256 2⇥2, 256 2⇥2, 256 2⇥2, 256 2⇥2, 256 2⇥2, 256 {6, 3, 2, 1} 4096 4096 1000 learned coefficients channel-shared channel-wise 0.681 0.596 0.103 0.099 0.228 0.561 0.321 0.204 0.294 0.464 0.126 0.089 0.124 0.062 0.008 0.210 0.196 0.152 0.145 0.124 0.134 0.198 0.063 0.031 0.074 0.075 Table 1. A small but deep 14-layer model [10]. The filter size and filter number of each layer is listed. The number /s indicates the stride s that is used. The learned coefficients of PReLU are also shown. For the channel-wise case, the average of {ai } over the channels is shown for each layer. conv1 pool1 conv21 conv22 conv23 conv24 pool2 conv31 conv32 conv33 conv34 conv35 conv36 spp fc1 fc2 fc3 7⇥7, 64, /2 3⇥3, /3 2⇥2, 128 2⇥2, 128 2⇥2, 128 2⇥2, 128 2⇥2, /2 2⇥2, 256 2⇥2, 256 2⇥2, 256 2⇥2, 256 2⇥2, 256 2⇥2, 256 {6, 3, 2, 1} 4096 4096 1000 0.681 0.596 0.103 0.099 0.228 0.561 0.321 0.204 0.294 0.464 0.126 0.089 0.124 0.062 0.008 0.210 0.196 0.152 0.145 0.124 0.134 0.198 64). Seco layers in g the activa creasing d keep more discrimina 2.2. Initi Rectifi pared with a bad initi lieve that this is a more economical way of explo non-linear initializati level information, 0.063 given the0.074 limited number of fi 0.031 0.075 64). Second, for the channel-wise version,extremely the d Recent layers in general have smaller coefficients.weights This di Table 1. A small but deep 14-layer model [10]. The filter size and standard d filterthe number of each layer isgradually listed. The number /s indicates“more the activations become nonlin (e.g., >8 stride s that is used. The learned coefficients of PReLU are also creasing depths. case, In other words, learned mod ported by shown. For the channel-wise the average of {ai } the over the periments channels is shown for information each layer. keep more in earlier stages and beco model wit discriminative in deeper stages. this strate top-1 top-5 to a poore ReLU 33.82 13.34 are added PReLU, channel-shared 32.71 12.87 Glorot PReLU, channel-wise 32.64 12.75 uni Rectifier networks are easier to train scaled [8, 16, “Xavier” i Table 2. Comparisons between ReLU and PReLU on the small pared with traditional sigmoid-like activation netw assumptio model. The error rates are for ImageNet 2012 using 10-view testis invalid ing.aThe images are resized so that can the shorter is 256, during bad initialization stillside hamper the learning In the both training and testing. Each view is 224⇥224. All models are non-linear system. In this subsection, weinitializati propo trained using 75 epochs. experimen initialization method that removes an obstacle deep mod extremely deep “Xavier” Then we train the samerectifier architecturenetworks. from scratch, with ImageNet2012 on small model PReLUの方がReLUよりいい channel wise の方がいい 2.2. Initialization of Filter Weights for Rec all ReLUs replaced deep by PReLUs (Table are 2). The top-1 error Recent CNNs mostly initialized b Forward is reduced to 32.64%. This is a 1.2% gain over the ReLU weights drawn from Gaussian distributions [16]. baseline. Table 2 also shows that channel-wise/channelOur deriv standard (e.g., For 0.01 [16]), very de shared PReLUs deviations perform comparably. the in channelinvestigat shared version, PReLU only introduces 13 extra free pa(e.g., >8 conv layers) have difficulties to conve For a c rameters compared with the ReLU counterpart. But this ported theparameters VGG team [25]roles andasalso small numberby of free play critical ev- observed idenced by the 1.1% To gain address over the baseline. This implies periments. this issue, in [25] they the importance of adaptively learning the shapes of activaHere, x is model with 8 conv layers to initialize deeper m

4.

noticing that E[x2l ] 6= Var[xl ] unless xl has zero mean. For the ReLU activation, xl = max(0, yl 1 ) and thus it does not have zero mean. This will lead to a conclusion different from [7]. If we let wl 1 have a symmetric distribution around zero and bl 1 = 0, then yl 1 has zero mean and has a symmetric distribution around zero. This leads to E[x2l ] = 12 Var[yl 1 ] when f is ReLU. Putting this into Eqn.(7), we obtain: ウェイトの初期化の話 x0 入力ベクトル 1 Var[y ] = n Var[w ]Var[y ]. 2 2^L yl l 層の出力ベクトル With L layers put together, we have: l Var(yl ) ⇠ Var(x0 ) l l Var[yL ] = Var[y1 ] l=2 (8) l 1 L Y 1 2 ! nl Var[wl ] . (9) This product is the key to the initialization design. A proper とするのが基本的方針 initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So we expect the above product to take a proper scalar (e.g., 1). A sufficient condition is: これがあれば結局BPでも問題ない 1 ReLUを使う事を考えれば n Var[w ] = 1, 2 l l 8l. The sc ReLU, put tog (10) leads to aReLU(x)) zero-mean Var(x) ~ This Var(W p Gaussian distribution whose stan- dard deviation (std) is 2/nl . This is our way of initialization. We also initialize b = 0. For the first layer (l = 1), we should have n1 Var[w1 ] = 1 because there is no ReLU applied on the input signal. But the factor 1/2 does not matter if it just exists on one layer. So we also adopt Eqn.(10) in the first layer for simplicity. Va We con expone The on that n̂l in a ze For becaus adopt E forwar not ma We Eqn.(1 Eqn.(1 the pro which signs. 1 nl Var(W ) = 1 n_l : d.o.f. of x 2 cnnでもmlpでも同じ?

5.
[beta]
gradien
It is
0.85
signal
---------- ours
last. In
0.8
Xavier
---------it is in
0.75
so larg
0
0.5
1
1.5
2
2.5
3
Epoch
tion is
other h
Figure 2. The convergence of a 22-layer large model (B in Tasmall
f
p
ble 3). The x-axis is the number of training epochs. The y-axis is
L
1/1
the top-1 error of 3,000 random val samples, evaluated on the centhe firs
ter crop. We use ReLU as the activation for both cases. Both our
are sm
initialization (red) and “Xavier” (blue) [7] lead to convergence, but
addres
ours starts reducing error earlier.
as 0.01, the std of the gradient propagated from conv10 to about [
For
conv2 0.95
is 1/(5.9 ⇥ 4.22 ⇥ 2.92 ⇥ 2.14 ) = 1/(1.7 ⇥ 104 ) of
show t
Error

0.9

Xavierとの比較
1

0.95

Error

Error

0.9

0.85
----------

ours

----------

Xavier

0.8

0.75

0

0.5

1

1.5
Epoch

2

2.5

3

Figure 2. The convergence of a 22-layer large model (B in Table 3). The x-axis is the number of training epochs. The y-axis is
the top-1 error of 3,000 random val samples, evaluated on the center crop. We use ReLU as the activation for both cases. Both our
initialization (red) and “Xavier” (blue) [7] lead to convergence, but
ours starts reducing error earlier.

0.95

Error

0.9

0.85

0.8

what we derive. This number may explain why diminishing
0.9
gradients
were observed in experiments.
It is0.85also worth noticing that the variance of the input
signal can
be roughly
preserved
from the first layer to the
0.8
ours
---------last. In cases when the input signal is not normalized (e.g.,
----------[ Xavier
it is in0.75the range of
128, 128]), its magnitude can be
so large 0that1 the2 softmax
operator
will
overflow.
A solu3
4
5
6
7
8
9
Epoch
tion is to normalize the input
signal, but this may impact
other hyper-parameters. Another solution is to include a
Figure 3. The convergence of a 30-layer small model (see the main
small
on the
weights
among
all or
some
p
text).
Wefactor
use ReLU
as the
activation
for both
cases.
Ourlayers,
initial- e.g.,
L
1/128
L layers.
practice, But
we “Xavier”
use a std(blue)
of 0.01
ization
(red) on
is able
to makeIn
it converge.
[7] for
the first two
0.001
the p
last.are
These
numbers
completely
stallsfc- layers
we alsoand
verify
that for
its gradients
all diminishing.
It does not
converge
even given
are smaller
than
they should
be more
(e.g.,epochs.
2/4096) and will
address the normalization issue of images whose range is
about [ 128, 128].
will be rescaled by a factor of L after L layers, where L
For the initialization
in theWhen
PReLU
case, ifit is>easy
can represent
some or all layers.
L is large,
1, to
show
thattoEqn.(10)
this
leads
extremelybecomes:
amplified signals and an algorithm

----------

ours

where
it beco
case (t
a2 )n̂l V

Comp

The m
“Xavie
nonline
linear
forwar
Gaussi

L laye
numbe
output of infinity; if < 1, this leads to diminishing sig1
nals2 . In either case,
algorithm
converge - it (15) the con
(1 +the
a2 )n
= 1,not 8l,
l Var[wl ]does
2
(Table
diverges in the former
case, and stalls in the latter.
ure 2 c
Our derivation
also explains
the coefficients.
constant standard
where
a is the initialized
valuewhy
of the
If a = 0,

6.

まとめとコメント PReLU -> 今はResNet 1 nl Var(W ) = 1 cnnでもmlpでも同じ? 2 Uniform vs Gaussian??