[DL輪読会]Deep Reinforcement Learning that Matters

>100 Views

December 12, 17

スライド概要

2017/12/8
Deep Learning JP:
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

DEEP LEARNING JP [DL Papers] Deep Reinforcement Learning that Matters Reiji Hatsugai http://deeplearning.jp/ 1

2.

k˜"āĢ! • õ²»Ž©ę:¬Ãý36& – €ûÄŖxÄ"  – IE@«Ęħ 6—Ř#"3 6¢ 6 /" • õ²»Ž©ę16j# ; ò6€ûÄŖ xÄ" :¥ "ñĻ¬ř:ġà4 !ijÜ • »Ž©ę#IE@s¨ ¦ "!IE@!ō6Ĵij#+57  difficulty#"3!ó+6" !"ijÜ • k˜#Û· #° .¬Ĭă ijÜ • +5++  Ĵij"cŠ! 7$ 2

3.

—Ř!6®Ķ         3

4.

—Ř!6®Ķ         4

5.

kß3  6IE@: HalfCheetah 5

6.

kß3  6IE@: Hopper 6

7.

          7

8.

             8

9.

           9

10.

                   10

11.

      !" ~$(&|(" )  ("*+ ~,(- . |(" , !" )   0"*+ = 0((" , !" , ("*+ )   11

12.

     !" ~$(&|(" )    $ ("*+ ~,(- . |(" , !" )   0"*+ = 0((" , !" , ("*+ )   ∞ π = arg max Eπ [∑ γ rτ ] ∗ π τ τ =0 12

13.

äļ"õ²»Ž©ę Soft Q TRPO PCL UNREAL ACER DDQN DQN SAC A3C Q-Prop D4PG ACKTR PPO IPG DDPG 13

14.

äļ"õ²»Ž©ę Soft Q TRPO PCL UNREAL ACER DDQN 『深層』強DQN 化学 習 に な っ て か ら SAC A3C D4PG た くさん の手法が開ACKTR 発された Q-Prop PPO IPG DDPG 14

15.

õ²»Ž©ę:ćĊ5rj • 3 6TIba 1. Ñô Ś 2. €û¬ř3Ś 3. |÷.0;ś¦päƒ#¬ģ ŌŃ6Ŝ 4. ś¬ģąŜŒ ijÜ(ÄĜ  śŝ!Ð6Ŝ 15

16.

Deep Reinforcement Learning that Matters • ICML2017"reproducibility work shopReproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control" Ĕ • AAAI2018 accepted • õ²»Ž©ę"€ûÄ" : – ¤™ă – ™ă /"śR=T[1¬ģŜ /"śü¡"S\<aE[aJYDbNŜ • !‚ħ77"¬řī«¬řÄĜñĻ6 • ÝÑô  á!åÊÄ"íĬ:ġ:Øì 16

17.

Deep Reinforcement Learning that Matters • ¬řñĻÑô# – – – – ACKTR (Wu et al. 2017) PPO (Schulman et al. 2017) DDPG (Lillicrap et al. 2015) TRPO (Schulman et al. 2015) • ACKTR, PPO#ÝĄ • DDPG, TRPO#3 ñĻ76baseline • Ÿçă!ñĻ#"®Ķ"T[ZI £â6 17

18.

Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 18

19.

Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 外因的なもの 19

20.

Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 内因的なもの 20

21.

Network Architecture • õ²»Ž©ę#3 r976²"îĿ 6 – (64, 64) (rllab) – (100, 50, 25) (Q-Prop) – (400, 300) (DDPG) • 74:•<]B\FY!®ñĻ • 4!Activation Function:£ñĻ 21

22.

Policy Architecture 22

23.

Activation Function 23

24.

Network Architecture • • • • • PPO¦ C=F"QKM`b@:rĞ  Tanh#2Á§ PPO"ĒéĦ6R=T[#€ûÄ! 5¦ ¾ŕ:f6 “This also suggests a possible need for hyper parameter agnostic algorithms” ï 24

25.

Reward Scale • • • • • QōÛ:©ę6á!r976L@PK@śDQN#clipingŜ 0 . = 20 σ=0.1 3 r976 ׫6w ETbE¦ w©ę œ‹ú! 6 śLeCun et al .2012; Glorot and Bengio 2010; Vincent, de Brebisson, and Bouthillier 2015Ŝ ʼn"ċŌ:œĖ6 25

26.

Reward Scale 26

27.

Reward Scale • • • • • Reward Scale#¦ ¾ŕ:f6 œ«´ twōÛ"©ę!¦ ¾ŕ:f6×­ •ü¡!33Reward Scale#Ā 6 Layer norm"åö£96 Learning values across many orders of magnitude (Hado van Hasselt et al. 2016) – µž‚Ú:adaptive!©ęü¡!36 6 • HumanoidStandup-v1 # – —Ř:ħ ʼn ʼn"EAb]"¾ŕ:¯ 100d"?bJb! !#Reward Scale:Ɔ!Ě6Âĥ  56 6 27

28.

Deep Reinforcement Learning that Matters • • • • • • Network Architecture Reward Scale Random Seeds and Trials Environments Codebases Reporting Evaluation Metrics 内因的なもの 28

29.

Random Seeds and Trials • 10˜–R=T[Ńseed˜ • 10˜:5˜5˜!‚ 6 • V_KM 29

30.

Random Seeds and Trials 30

31.

Random Seeds and Trials 31

32.

Random Seeds and Trials <0.05 32

33.

Random Seeds and Trials • [aJY!CaV]/":2ė!‚ – 77"ş – 3 9 4  ‚Ú ¦ åʳ 6 #Ĩ • •ijÜÝÑô z76:iº6 Įġ˜Û /"#seed!36/""”ĜÄ • power analysisĮġ˜Û:Ħĉ/6;0  • 4 6IJê Âĥ ° 33

34.

Environment • •Ñô:Hopper, HalfCheetah, Swimmer, Walker2D!ņý • •Ñô ü¡ð!"3 ÄĜ! 6 ñĻ6 34

35.

HalfCheetah 35

36.

Hopper 36

37.

HalfCheetah • HalfCheetah"3 J=OX@E «´ /"#DDPG » • Hopper"3!J=OX@E œ«´DDPG#¹ • ś«´œ«´!#Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous ControlŜ • ü¡"œ«´Ä »DDPG"3 QōÛ:©ę6/"#Œ 6 • Ľ!ĨHalfCheetah#DDPG!č "DDPG base"<] B\FY"ĭt!HalfCheetah:161#unfair 37

38.

Swimmer 38

39.

Swimmer • › Ħ6TRPO"cj* • l"policy#local optimal!ŏ6 • ʼn"¦   ¬Ő"Ԍ/Ħ • ś ʼnīĩ Ç"#Ŝ   39

40.

Code base • TRPODDPG!#  "Ġě¬ģ1rllab, baseline"¬ģ • –R=T[Ā 6¬ģ˜ 6 40

41.

Code base 41

42.

Code base • –R=T[ "! • dramatic impacts on performance • ijÜ!ã 7 įđ ¬ģ ¦ ¾ŕˆ:Ò 42

43.

Reporting Evaluation Metrics • k+"õ²»Ž©ę"ĭtÞô# q˜ Įġ‚Úm V_KMñĻ • k+Ħ3!õ²»Ž©ę#‚Ú ¦ • Øì6ĭtÞô – vŗŌ:ĭt6 – í«ˆ‚è:Įġ˜Û:«.6 – <]B\FYŌ"ñĻ!åʳ 6  ēĩă!ĭt6 43

44.
[beta]
Deep Reinforcement Learning that Matters+.
• ™ă¤™ă"gÞ!õ²»Ž©ę"€ûÄ#¾ŕ:“
• 7 4"õ²»Ž©ę"ĭtÞô
–
–
–
–

6

[aJYDbN: ;˜ĭt
ēĩă í«/„ý
R=TbT[ZI:|}ŋ6
¬ģ"įđ¬řī«:}ŋ6

• 7

4"õ²»Ž©ę

ŀ,ł

– hyperparameters agnostic algorithm

• “There is often no clear winner among all benchmark environments.”

44

45.

Ï# !:ħ6" • ijÜ#HalfCheetahHopper:DDPGħ 6  1©ę! ª«Ä 6  ü¡:stable, unstableĭ • task difficulty#Ē±algorithmGKMĚ6" 3"8 ş • Simple Nearest Neighbor Policy Method for Continuous Control Tasks – Nearest NeighborPolicy:Ģû – task difficulty:©ę! 6È "œĖ"ň‚task7ĝp"Œ!‚ħ – NN:rÈ "œĖ#Ětask7ĝp"Œ:×Ŋ6 45

46.
[beta]
Ñô
• NN-1, NN-2"Ş:Øì
• c«ne"Đĉ ʼn À47ĺĹ:u¨
• NN-1
1. û"ƒæùÌu¨7ĺĹ"ƒæùÌ"h
&
2. Ň$7ĺĹľ5!ä¿+action:ġ

4cÿļ/":Ň

• NN-2
1. û"ùÌu¨7ĺĹ"ùÌ"h 4cÿļ/":Ň&
2. u¨7ĺĹ"action:1stepġ1!Ð6

• >UHbN"ä¿

ʼn

ķ6Sparse reward
46

47.

NNĒé 47

48.

Simple Nearest Neighbor • • • • • • Sparse Mountain Car:Ŏ6Ĉ¶Î‰6 ijÜ/ 7HalfCheetah/Ή6 HalfCheetah#1#5č* task difficulty"‚ħ"āÉ#ŔĂ  ICLR3,4,4 NNPolicyĝp#Ł‘!Øì7Ñô6ç¼!č E@ Ή Č ÓÙ76 I 48

49.
[beta]
ħ6—Ř#¬#;
• ď ĪÍ",!36©ęħ
HalfCheetah#Ŝ
• õ²©ę"ĢûĜˆ#6"

!Ģûˆ4

6^W]"—Řś°

İ

/

ş

– þy:{ˆ!6"7$Âĥ
– 0sensor{ˆşş

• //õ

3²"MLP

r



• Towards Generalization and Simplicity in Continuous Control
–
–
–
–

Policy"parameterize:ĕ½1RBFĢû
Natural Gradient:râÝ<]B\FY:rý
Neural Netńğ Ēéśhumanoid#
Ŝ
~Ġě!"mujoco"Todorov;Natural Gradient"Kakade;

6 49

50.

Towards Generalization and Simplicity in Continuous Control 50

51.
[beta]
õ²»Ž©ę!—ŘØĸ6ĎijÜı;,ËÉ
• õ²»Ž©ę"€ûÄŖxÄ"—Ř#ëõ
• sensor{ˆ#ø!Ĥő ü¡:ŎDeepLearning!36ĵ
ûˆ"ÅÆ:“  
• -8äņŽ:šŒ!6ŚşśĥćĊŜ
• č —Ř:Ĥő!Õ+6"# 

Ģ

– vý‡¼—Ř:‡¸úħó6%5:6 sparse reward
4//ħ Âĥ" —Ř
– ʼnōÛ½ŌÖă!!IE@:«Ę6"#Ņ˜5 ;0 "
• IL, IRL??

–

ʼnōÛ"EAb] ©ę!¦
¾ŕ:’)á"3! ʼnōÛ:ó
.6'
" //"3 o"/" " normalize

"ş
51