1.4K Views
January 16, 23
スライド概要
2023/1/13
Deep Learning JP
http://deeplearning.jp/seminar-2/
DL輪読会資料
Mastering Diverse Domains through World Models Shohei Taniguchi, Matsuo Lab
ॻࢽใ Mastering Diverse Domains through World Models https://arxiv.org/abs/2301.04104 • ஶऀ • Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap • ֓ཁ • ੈքϞσϧΛͬͨڧԽֶशख๏Dreamerͷվળ൛ (ver. 3) • εΫϥονͷڧԽֶशͰॳΊͯMinecraftͰμΠϠϞϯυΛͱΔ͜ͱʹޭ 2
Minecraft ObtainDiamond • MinecraftͰμΠϠϞϯυΛͱΔλεΫ • ใुɼதؒΞΠςϜ͔μΠϠΛͱͬͨͱ͖ͷΈಘΒΕΔ • NeurIPSͰ2019͔Βίϯϖ͕ߦΘΕ͓ͯΓɼRLڀݚͷ1ͭϚΠϧετʔϯ • ͜Ε·ͰεΫϥονͷRLͰμΠϠ֫ಘ·Ͱ ޭͨ͠ྫͳ͠ • ਓؒͷσϞΛ͏ख๏Ͱͷޭྫ͋Γ
ൃද֓ཁ • લఏࣝ • ੈքϞσϧ x ڧԽֶश • PlaNet, Dreamer, DreamerV2 • DreamerV3 • ·ͱΊ εϥΠυͷҰ෦ΛҎԼ͔Βྲྀ༻͍ͯ͠·͢ 4 https://www.slideshare.net/ShoheiTaniguchi2/ss-238325780
ڧԽֶशͷ՝ αϯϓϧޮ • ֶशʹେྔͷ͕͔͔࣌ؒΔ • ϩϘοτͳͲͦΜͳʹසൟʹֶ࣮Ͱػशͤ͞Δͷίετతʹ͍͠ݫ 5
ੈքϞσϧ x ڧԽֶश ڥͷϞσϧΛਂֶशͰ֫ಘͰ͖Ε ͦͷϞσϧͰڥΛγϛϡϨʔτͯ͠ ํࡦΛֶशͰ͖Δͣ ➡ ੈքϞσϧ 6
ੈքϞσϧ x ڧԽֶश ֶशͷྲྀΕ 1. ํࡦ π Ͱ͔ڥΒσʔλ D ΛूΊΔ D = {x1, a1, r1, …, xT, aT, rT} 2. D Λ༻͍ͯੈքϞσϧ pψ Λֶश pψ (x1:T, r1:T ∣ a1:T) 3. ੈքϞσϧΛ༻͍ͯํࡦ π Λߋ৽ https://arxiv.org/abs/1903.00374 • 1 ~ 3Λ܁Γฦ͢ 7
World Models https://worldmodels.github.io/ [Ha and Schmidhuber, 2018] • ੈքϞσϧܥͷڀݚͷΓͱ͍͑Δจ • ੈքϞσϧͷֶशɿVAE + MDN-RNN • ํࡦͷֶशɿCMA-ES • ࠓճৄ͍͠༰ׂѪ͠·͢ ʢҎԼͷεϥΠυͳͲΛࢀরʣ https://www.slideshare.net/masa_s/ss-97848402 8 https://arxiv.org/abs/1803.10122
https://planetrl.github.io/ PlaNet [Hafner, et al., 2019] • ੈքϞσϧͷֶशɿ • Recurrent State Space Model ্ɿ࣮ͰڥͷϩʔϧΞτ • ํࡦͷֶशɿCEM ԼɿੈքϞσϧʹΑΔγϛϡϨʔγϣϯ • ϞσϧϑϦʔͱ΄΅ಉͷੑೳ DM Control SuiteͰͷ࣮݁ݧՌ 9 https://arxiv.org/abs/1811.04551
Ψεܕঢ়ଶۭؒϞσϧ Gaussian State Space Model • ঢ়ଶભҠ֬ʹਖ਼نΛ͏Ϟσϧ pψ (st+1 ∣ st, at) • = Normal μψ (st, at), diag σψ2 (st, at) ( ( )) • 2 ؔ μψ, σψ ʹDNNͳͲΛ༻͍Δ • ͜Εͩͱ࣮ݧతʹ͏·͍͔͘ͳ͍ʢޯফࣦͳͲʣ 10 rt rt+1 at at+1 st st+1 ot ot+1
࠶ؼతঢ়ଶۭؒϞσϧ Recurrent State Space Model (RSSM) • ঢ়ଶ s ΛܾఆతʹભҠ͢Δ h ͱ ֬తʹભҠ͢Δ z ʹ͚ͯϞσϧԽ͢Δ ht+1 = fψ (ht, st, at) pψ (st ∣ ht) = 2 Normal (μψ (ht), diag (σψ (ht))) • fψ LSTMͳͲͷRNNܕͷؔ 11 rt rt+1 at at+1 ht ht+1 st st+1 xt xt+1
࠶ؼతঢ়ଶۭؒϞσϧ Recurrent State Space Model (RSSM) RSSMΛ͏ͱ͔ͳΓੑೳ্͕͕Δ 12
https://ai.googleblog.com/2020/03/introducing-dreamer-scalable.html Dreamer [Hafner, et al., 2019] • PlaNetΛϕʔεʹͯ͠ɺ ํࡦͷֶशΛActor-Criticʹܕมߋ • Ձؔʹ λ ऩӹΛ༻͍Δ • PlaNet͔Βੑೳ͕େ෯ʹվળ 13 https://arxiv.org/abs/1912.01603
Ձؔͷਪఆ ϕϧϚϯํఔࣜ V (st) = π π [r (st, at)] + V (st+1) π n εςοϓʹ֦ு͢Δͱ = r s , a + V s ( ) ( π ∑ t+k t+k t+n) [ k=1 ] π 𝔼 𝔼 π Vn (st) n−1 14
Ձؔͷਪఆ π Vn (st) = n−1 π r (st+k, at+k) + V (st+n) ∑ [ k=1 ] π n = 1,…, ∞ ͰࢦฏۉΛͱΔͱ V̄ (st, λ) = (1 − π ∞ n−1 π λ) λ Vn (st) ∑ n=1 𝔼 ͜ΕΛ λ ऩӹͱͿݺ 15
Ձؔͷਪఆ DreamerͰɺλ ऩӹΛՁؔͷλʔήοτͱ͢Δ θ ← θ − ηθ ∇θ pψ,πϕ [ πϕ Vθ (st) − V̄ (st, λ) π ] 2 ͨͩ͠ɺࢦฏۉͷదͳେ͖͞ʢHͱ͢ΔʣͰଧͪΔ V̄ (st, λ) ≈ (1 − n−1 π λ) λ Vn (st) ∑ n=1 𝔼 π H−1 16 + H−1 π λ VH (st)
λ ऩӹͷޮՌ No valueํࡦޯ๏Ͱֶशͨ͠߹ͷ݁Ռ λ ऩӹΛ༻͍Δ͜ͱͰɺH ʹґΒͣੑೳ͕վળ 17
DreamerV2 [Hafner, et al., 2020] Dreamerͷվྑ൛ 1. જࡏมʹࢄͳΧςΰϦΧϧΛ͏ 2. Τϯίʔμ͕աʹਖ਼ଇԽ͞Εͳ͍Α͏ʹ KL߲ͷֶशΛௐ͢Δ • AtariͰਓؒϨϕϧͷੑೳΛୡ 18
ࢄજࡏม • PlaNetDreamerV1Ͱɼ࿈ଓతͳજࡏมΛ͍ɼਖ਼نͰϞσϧԽ • DreamerV2ͰɼࢄͳΧςΰϦΧϧʹมߋ 19
ࢄજࡏม • ࢄʹͨ͜͠ͱͰɼޯͷਪఆʹreparameterization trick͑ͳ͘ͳΔ • ΘΓʹstraight-through estimatorͰਪఆ • ਪఆྔʹόΠΞε͕Δ͕ɼ࣮͕؆୯ 20
KL Balancing • ੈքϞσϧͷϩεʹ͓͍ͯɼKL߲encoderͱભҠϞσϧͷpriorΛ͚ۙͮΔ ਖ਼ଇԽͷׂΛ͢Δ • ͔͠͠ɼಛʹֶशॳʹظભҠϞσϧ͕ेʹֶशͰ͖͍ͯͳ͍ঢ়ଶͩͱ ͜ͷKLਖ਼ଇԽ͕ͳ͘ڧΓֶ͗ͯ͢शͷ͛ʹͳΔ 21
KL Balancing • EncoderͱભҠϞσϧͷKL߲ʹ͍ͭͯͷֶशΛௐ͢Δ͜ͱͰܰݮ • α0.8ʹઃఆ 22
࣮ݧ • AtariͰਓؒ͑ • ϞσϧϑϦʔͷDQN, RainbowͳͲΑΓ͍ڧ 23
࣮ݧ Ablation • ΧςΰϦΧϧมKL balancingͷޮՌ͔ͳΓେ͖͍ 24
DreamerV3 25
DreamerV3 • DreamerV2ΛΑΓ൚༻తʹ͑Δख๏ʹ͢ΔͨΊʹ͍͔ͭ͘ΛՃ • υϝΠϯ͕มΘͬͯৗʹಉ͡ϋΠύϥͰֶशͰ͖ΔΑ͏ʹ 1. ؍ଌใुͷΛsymlogؔͰม͢Δ 2. ActorͷతؔͰλऩӹͷΛਖ਼نԽ͢Δ 26
Symlog Prediction • υϝΠϯ͕มΘΔͱɼ؍ଌใुͷͷεέʔϧ͕มΘΔͷͰɼ ஞҰϋΠύϥΛௐ͢Δඞཁ͕͋Δ • ͦΕΛ͠ͳ͍͍ͯ͘Α͏ʹɼsymlogؔΛ͔͚Δ͜ͱͰΛ͋Δఔἧ͑Δ • ՄͳؔͳٯͷͰɼؔٯΛ͔͚ΕݩͷʹͤΔ 27
λऩӹͷਖ਼نԽ • Τϯτϩϐʔਖ਼ଇԽ͖ͰactorΛֶश͢Δ߹ɼͦͷͷνϡʔχϯά ใुͷεέʔϧεύʔεੑʹґଘ͢ΔͷͰ͍͠ • ͏·͘ใुͷΛਖ਼نԽͰ͖ΕɼυϝΠϯʹΑΒͣΤϯτϩϐʔ߲ͷΛ ݻఆͰ͖Δͣ 28
λऩӹͷਖ਼نԽ • ऩӹΛ5ʙ95%Ґͷ෯Ͱਖ਼نԽ͢Δ • ୯७ʹࢄͰਖ਼نԽ͢Δͱɼใु͕εύʔεͳͱ͖ʹɼऩӹ͕աେධՁ͞Εͯ ͠·͏ͷͰɼ֎ΕΛ͚ΔΑ͏ʹ͜ͷ͢ʹܗΔ 29
࣮ݧ • ͯ͢ͷυϝΠϯɾλεΫͰಉ͡ϋΠύϥͰߴ͍ੑೳ͕ग़ͤΔ 30
࣮ݧ • ϞσϧͷαΠζʹΑͬͯੑೳ͕εέʔϧ͢Δ͜ͱ֬ೝ 31
࣮ݧ ੈքϞσϧʹΑΔະདྷ༧ଌ 32
࣮ݧ • MinecraftͰॳΊͯRL agent͕μΠϠϞϯυΛͱΔ͜ͱʹޭ 33
·ͱΊ • ੈքϞσϧͷදతͳख๏DreamerͷൃలΛղઆ • V3ʹؔͯ͠ਖ਼ώϡʔϦεςΟοΫͷմײ൱Ίͳ͍ • ݁Ռ͍͢͝ 34