130 Views
October 09, 25
スライド概要
DL輪読会資料
DEEP LEARNING JP [DL Papers] Subtask-Aware Visual Reward Learning from Segmented Demonstrations Jeremy Siburian, Matsuo-Iwasawa Lab, M1 http://deeplearning.jp/ 1
Paper Overview • Paper Title: Subtask-Aware Visual Reward Learning from Segmented Demonstrations • Authors: Changyeon Kim1, Minho Heo1, Doohyun Lee1, Honglak Lee2,3, Jinwoo Shin1, Joseph J. Lim1, Kimin Lee1 (1KAIST, 2University of Michigan, 3LG AI Research) • International Conference on Learning Representations (ICLR) 2025 • Links: – ArXiv: https://arxiv.org/abs/2502.20630 – Project Page: https://changyeon.site/reds/ Disclaimer: All credits for images, figures, tables, and other contents belong to the original authors. 2
Introduction • • • • Reinforcement Learning (RL) methods has shown strong potential for robotic control. However, RL still depend heavily on handcrafted reward functions, which require substantial trial-anderror, extensive task knowledge, and instrumentation. Learning reward functions from action-free videos has emerged as a promising alternative. Existing methods often struggle with long-horizon robotic tasks that involve multiple subtasks. – Focus on short temporal windows or final goal frames. – Ignore contextual subtask progression (e.g., pick → insert → tighten). Designing an effective visual reward function for real-world, long-horizon tasks remains an open problem. Example: In a “One Leg” furniture assembly task, prior methods reward picking up the leg but fail to guide later steps (inserting and tightening). 3
Related Work Reward Learning from Videos • • Prior works have focused on Temporal Alignment Methods, Progress-Based Rewards, or pre-trained video prediction likelihood as a reward signal. Limitations: Focus mainly on short-term temporal structure and often fail to capture multi-stage subtasks. Inverse Reinforcement Learning • • • Inverse Reinforcement Learning (IRL) aims to estimate the underlying reward function from expert demonstrations. Adversarial Imitation Learning (AIL): reformulates IRL using a discriminator to distinguish expert vs. agent transitions, and the discriminator’s output is used as a learned reward. Prior work (Mu et al., 2024) also utilizes subtask information of the multi-stage task, but assumes that the information on ongoing subtasks can be obtained from the environment during online interaction. Summary: Prior works in video-based reward learning and IRL/AIL either lack subtask awareness or require explicit supervision. 4
Preliminaries (1/2) (1) Problem Formulation • Markovian Decision Process (MDP) MDP is defined as a tuple M = (𝑆, 𝐴, 𝑝, 𝑅, 𝜌0, 𝛾) • 𝑆 = state space consisting of K consecutive images • 𝐴 = action space • 𝑅 = sparse reward function • 𝑝 𝑠 ′ 𝑠, 𝑎 = transition function • 𝜌0 = initial state distribution • 𝛾 = discount factor The policy 𝜋 ∶ 𝑆 → ∆(𝐴) trained to maximize the expected sum of discounted rewards Our Goal: Find a dense reward function 𝑅 𝑠 ′ only conditioned on visual observation to get optimal policy 𝜋 ∗ for M 5
Preliminaries (2/2) (2) Equivalent-Policy Invariant Comparison (EPIC) A pseudometric to measure the similarity between two reward functions, ensuring they produce the same optimal behaviors (policies), even if the raw reward values differ. In REDS, EPIC is used to train the reward model so that its predicted rewards are policy-equivalent to groundtruth subtask rewards. For more details: https://arxiv.org/abs/2006.13900 (Gleave et al., 2021) 6
Method REDS: REward learning from Demonstration with Segmentations This work proposes REDS, a visual reward learning framework that leverages action-free videos with minimal supervision by treating segmented video demonstrations as ground-truth rewards. 7
Method (1) Subtask Segmentation The main point is to leverage expert demonstrations annotated with the ongoing subtask as the source of the implicit reward signals. • • • • A task is decomposed into 𝑚 object-centric subtasks, denoted as 𝑈 = {𝑈1, … , 𝑈𝑚}. Each subtask represents a distinct step in the task sequence and is based on the coordinate frame of a single target object. Additionally, text instructions that describe how to solve each subtask are provided. In this work, experiments use the predefined codes in Meta-world and human annotators in FurnitureBench to collect subtask segmentations. 8
Method (2) Architecture & Reward Learning We train a reward model conditioned on video segments and corresponding subtasks with 1) contrastive loss to attract the video segments and corresponding subtask embeddings and 2) EPIC loss to generate reward equivalent to subtask segmentations. • • • • • REDS predicts rewards from videos and subtask text. Images are turned into visual features using a pre-trained vision model. A causal transformer processes frame sequences to learn temporal dependence. Subtask instructions are converted into text embeddings using a language encoder. The model combines both to predict a reward for each step. REDS uses three key objectives: EPIC loss + Regularization loss + Contrastive loss 9
Method (3) Training & Inference In online RL, REDS infers the ongoing subtask using only video segments at each timestep and computes the reward with that. • • Initial Training: Collect expert demonstrations with subtask segmentations. Train initial reward model using this data. Problem: Rewards learned only from experts may suffer from reward misspecification (Pan et al., 2022). Fine-Tuning Steps 1. Collect Suboptimal Data: Train an RL agent with the current model and gather new trajectories. 2. Infer Subtasks: Estimate which subtask each frame belongs to using similarity scores. Detect failed subtasks when the similarity is too low. 3. Fine-Tune: Combine expert and suboptimal data and retrain the model. 10
Experiments Tasks 8 different visual robotic manipulation tasks from Meta-World in simulation and the One Leg Assembly task from FurnitureBench in the real-world. Meta-World (Yu et al., 2020) FurnitureBench (Heo et al., 2023) 11
Experiments Baselines 1. Human-engineered reward functions 2. ORIL: AIL method trained only with offline demonstrations 3. Rank2Reward (R2R): AIL method which trains a discriminator weighted with temporal ranking of video frames to reflect task progress 4. VIPER: reward model utilizing likelihood from a pre-trained video prediction model as a reward signal 5. DrS: AIL method that assumes subtask information from the environment and trains a separate discriminator for each subtask. AIL = Adversarial Imitation Learning 12
Results: Meta-World • • REDS consistently outperformed all baselines. In several tasks (Drawer Open, Push, Coffee Pull), REDS even surpassed the human-engineered dense rewards. 13
Results: FurnitureBench • • REDS nearly doubled performance over VIPER and DrS after online fine-tuning. REDS outperforms IQL trained with 500 expert demos, despite only using fewer (300) demos. 14
Results: Generalization Capabilities (1/3) (1) Transfer to Unseen Tasks • • Training with 3 tasks (Drawer Open/Close, Door Open) Evaluation on two unseen tasks (Door Close, Window Close) – – Door Close: Can REDS provide informative signals for a new task involving a previously seen object and behaviours? Window Close: Can REDS provide suitable reward signals for familiar behaviors with an unseen object? REDS provide effective reward signals on unseen tasks and achieves comparable or even better performance than REDS trained on the target task. 15
Results: Generalization Capabilities (2/3) (2) Robustness to Visual Distractions • • Visual distractions → varying light and table positions following Xie et al. (2024) REDS can generate robust reward signals despite visual distractions and train RL agents to solve the task effectively. 16
Results: Generalization Capabilities (3/3) (3) Transfer to Unseen Embodiments • • • • Hypothesis: Since REDS uses action-free video data, it should generalize across robots with similar degrees of freedom (DoFs). Train REDS using demos from Franka Panda Arm, evaluate on unseen demos from Sawyer Arm. Target Task → Take Umbrella Out of Stand from RLBench (James et al., 2020) REDS produced meaningful reward signals even for the new robot. 17
Ablation Studies (1/4) Effect of Training Objectives Compared variants: 1. Without EPIC loss (replaced with simple regression to subtask segmentation) 2. Without subtask embeddings (utilizes only video representations) 3. Without regularization loss Findings: RL performance significantly degrades without each component, implying our losses synergistically improve reward quality. 18
Ablation Studies (2/4) Effect of Architecture Compared variants: 1. CNN-based encoder for image 2. No causal transformer Findings: Both variants show worse performance compared to REDS. Performance is worst without a causal transformer, highlighting importance of temporal information for providing suitable reward signals. 19
Ablation Studies (3/4) Effect of Finetuning Compared variants: 1. REDS No Fine-tuning (trained only with expert demos) 2. REDS Finetuned (with additional suboptimal demos) Findings: REDS shows improved RL performance when trained with additional suboptimal demonstrations, indicating that the coverage of state distribution impacts the reward quality. 20
Ablation Studies (4/4) Effect of Expert Demonstrations Compared variants: REDS trained with different number of expert demos (10, 20, and 50 demos) Findings: The agents’ RL performance positively correlates with the number of expert demonstrations trained for reward learning. 21
Limitation and Future Directions Limitation #1: Assumption of Known Subtasks • Assumes knowledge of the object-centric subtasks in a task. • Future Improvement: Automate subtask definition and segmentation using MLLMs Limitation #2: Dependence on Pre-trained Representations • Relies on visual and textual encoders. While effective, it struggles with subtle motion distinctions • Future Improvement: Use large-scale robotic pretraining and incorporate affordance-aware representations. Limitation #3: Generalization & Robustness • May fail in out-of-distribution (OOD) cases (e.g., drastic background or camera changes). • Future Improvement: Apply data augmentation or domain adaptation techniques. Improve contrastive learning for reducing confusion in visually similar subtasks. Limitation #4: Data Efficiency & Fine-tuning • Depends on empirically chosen demo counts and fine-tuning steps. • Future Improvement: Investigating how to collect failure demonstrations to mitigate reward misspecification. 22
Summary of Contributions REDS: REward learning from Demonstration with Segmentations • • • • This paper presents REDS, a novel visual reward learning framework which can produce suitable reward signals aware of subtasks in long-horizon complex robotic manipulation tasks. REDS significantly outperforms baselines in training RL agents for robotic manipulation tasks in Meta-World, and even surpasses dense reward functions in some tasks. REDS can train real-world RL agents to perform long-horizon complex furniture assembly tasks from FurnitureBench. REDS shows strong generalization across various unseen tasks, embodiments, and visual variations. 23