225 Views
February 02, 18
スライド概要
2018/2/2
Deep Learning JP:
http://deeplearning.jp/seminar-2/
DL輪読会資料
1 DEEP LEARNING JP [DL Papers] “Zero-Shot Visual Imitation (ICLR2018)” Zero-Shot Iori Yanokura, JSK Lab Visual Imitation (ICLR 2018) http://deeplearning.jp/
2 • ICLR 2018 accepted • Project page: https://sites.google.com/view/zero-shot-visual-imitation/home • Reviews: 8 (confidence: 4), 8 (confidence: 3), 7 (confidence: 5) = rank of 986 / 1000 = top 2% • • • • : Deepak Pathak, Parsa Mahmoudieh, Michael Luo, Pulkit Agrawal, Dian Chen,Fred Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, Trevor Darrell
3 • • • • Reward( ) Reward Engineering... (Sutton&Barto, 1998)
4 (Imitation Learning) 2 • •
5 • • • •
6 • One-shot imitation learning (Duan et al., 2017) - 1
7 (Visual Demonstration) • • • (Nair et al., 2017) Sermanet et al., 2016) (Liu et al., 2017)
8 • •
9 V Policy goal • Universal value functions (Schaul et al., 2015) - V(s;θ) • Hindsight experience replay (Andrychowicz et al., 2017) -
10 3. LEARNING TO IMITATE WITHOUT EXPERT SUPERVISION • • • • •
11 • Imitation Learning - • Visual Demonstration - • Forward and Inverse Dynamics - Dynamics model multiple steps policy • Goal Conditioning - zero-shot
12 Forward dynamics
13 3.1 LEARNING THE PARAMETRIC SKILL FUNCTION • • • • •
14 3.1 LEARNING THE PARAMETRIC SKILL FUNCTION • One-step (st, at, st+1) Ground-truth • Cross-entropy loss SGD •
15 3.2 MODELING ACTION DISTRIBUTION VIA FORWARD CONSISTENCY • State • RNN model-free action state forward dynamics
16 3.2 MODELING ACTION DISTRIBUTION VIA FORWARD CONSISTENCY • θh, θf forward dynamics f
17 3.3 GOAL SATISFACTION • • •
18 4 Results • Navigation • • • PSF
19 4.1 NAVIGATION IN INDOOR OFFICE ENVIRONMENT
20 4.1 NAVIGATION IN INDOOR OFFICE ENVIRONMENT (Result) https://www.youtube.com/watch?v=ynfVRM27YFU https://www.youtube.com/watch?time_continue=3&v=OwvnqjgUqc8
21 4.2 VISION - BASED ROPE MANIPULATION Task: State: RGB Action: ( ) (Nair et al., 2017)
22 4.2 VISION - BASED ROPE MANIPULATION(Result) https://www.youtube.com/watch?v=YlaojV XHagM
23 5 CONCLUSION • • • • (Andrychowicz et al., 2017)
24 • RL • •
25 Reference Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining selfsupervised learning and imitation for vision-based rope manipulation. ICRA, 2017. Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2017. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxima- tors. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1312–1320, 2015. Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever,Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699, 2016. Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. arXiv preprint arXiv:1703.01703, 2017. YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learn- ing to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374,2017. Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, 2015. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313, 1989. Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 135. Cambridge: MIT press, 1998.