【DL輪読会】PlayWorld: Learning Robot World Models from Autonomous Play

>100 Views

April 23, 26

スライド概要

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

DEEP LEARNING JP [DL Papers] PlayWorld: Learning Robot World Models from Autonomous Play Jeremy Siburian, Matsuo-Iwasawa Lab, M2 http://deeplearning.jp/ 1

2.

Paper Overview PlayWorld: Learning Robot World Models from Autonomous Play Paper Details • Authors: Tenny Yin1, Zhiting Mei1, Zhonghe Zheng1 , Miyu Yamane 1 , David Wang 1 , Jade Sceats1 , Samuel M. Bateman 1 , Lihan Zha1 , Apurva Badithela 1 , Ola Shorinwa1 , Anirudha Majumdar 1 (1Princeton University) • • Arxiv Preprint, 2026 Links: – – – ArXiv: https://arxiv.org/abs/2603.09030 Project Page: https://robot-playworld.github.io/ Code: https://github.com/irom-princeton/open-world Disclaimer: All credits for images, figures, tables, and other contents belong to the original authors. 2

3.

Introduction • • Action-conditioned video models fine-tune video generation backbones with robotics datasets, but they remain vulnerable to hallucinations when simulating contact-rich interactions. Existing models are typically trained on human-collected demonstrations, leading to systematic biases and poor coverage of interaction dynamics. Cosmos Policy [Kim et al. 2026] DreamGen [Jang et al. 2025] 3

4.

Introduction Motivation: How can we obtain the most useful data for learning high-quality world models? Training robot world models based on autonomous play Broad coverage of diverse contact events Efficient & diverse scaling with minimal supervision 4

5.

Related Work What Is Autonomous Play Data In Robotics? • • Play data consists of unstructured, task-agnostic robot interactions that cover a broader range of behaviors than narrow expert demonstrations. Recent work also uses play data to train world models for policy improvement, but still rely on human-collected play data or supervised policy rollouts, limiting both scalability and interaction diversity. From Play to Policy [Cui et al. 2023] SOAR [Zhou et al. 2024] 5

6.

Related Work World Models and Generalizable Dynamics • • World modeling can be viewed as learning generalizable dynamics under distributional mismatch: the goal is to train an action-conditioned model that remains accurate beyond the training distribution. In robotics, video world models enable synthetic trajectory generation, policy evaluation, and policy improvement, but prior work often trains them on expert demonstrations, manual rollouts, or offline datasets, making it hard to achieve the broad state-action coverage needed for generalizable dynamics. Ctrl-World [Guo et al. 2026] WorldGym [Quevedo et al. 2026] 6

7.

PlayWorld Framework PlayWorld is a framework for training action-conditioned video models that can predict diverse contact dynamics with fine-grained precision. Learning accurate world models requires broad, unstructured interaction data from autonomous play 7

8.

PlayWorld Framework Autonomous Robot Play Data Collection How to design πplay for reliable & autonomous play data collection? Key Design Requirements 1. The robot needs to engage in diverse interaction with objects in the scene (in order to obtain meaningful coverage as data collection expands) 2. The system must reliably prevent and recover from potential failures when human supervision is unavailable 3. The system should be amenable to diverse objects and generalize to diverse language instructions ℓ without requiring manual engineering VLM Task Proposer + VLA Task Executor 8

9.

PlayWorld Framework Play Data Collection System Design (1) Task Proposer • • A VLM observes the current scene and generates a natural-language task for the robot. It introduces small variations in the instruction to increase diversity while keeping tasks executable. This creates task-level exploration without needing hand-designed rewards or extra exploration training. (2) Task Executor • • A VLA policy executes the language instruction produced by the VLM. Small changes in wording can lead to noticeably different robot behaviors, leading to more diverse contact dynamics under different object configurations. (3) Safety Filter & Resets • • A lightweight safety filter enforces conservative workspace and joint limits for reliable unsupervised execution. The VLM detects when objects drift out of reach, and the VLA performs a scene reset by bringing them back. This allows long-duration autonomous collection, including overnight runs. 9

10.

PlayWorld Framework Video Model Architecture / Training • • • Uses a Stable Video Diffusion (SVD) backbone as the world model, initialized from large-scale robot data and fine-tuned on PlayWorld data. The model is action-conditioned and jointly predicts multi-view observations, allowing it to model scene evolution under manipulation. Training targets both common motions and contact-rich interaction dynamics. Curriculum Learning • • • Play data is long-tailed: it contains many easy, repetitive transitions and fewer rare, diverse interactions. PlayWorld builds a curriculum by ranking play samples by their distance to successful demonstrations in feature space. Training starts with more success-like samples and gradually shifts toward harder, more exploratory ones, improving learning of rare contact dynamics. Curriculum learning helps the model focus on rarer, harder interactions instead of overfitting to easy ones. 10

11.

Experiment Setup Research Questions 1. Does PlayWorld induce more diverse object interactions compared to human-collected data? 2. Can PlayWorld improve video prediction accuracy for diverse object interactions compared to models trained on human demonstration data? 3. Can PlayWorld enable fine-grained policy evaluation by reliably predicting outcomes across a broad range of policies and tasks? 4. Can PlayWorld enable policy fine-tuning through interactive roll-outs in the video model? 5. Does PlayWorld result in improved accuracy and generalization with data scale compared to human collected data? Baselines Two types of teleoperation data as baselines: 1. Human Demo Task-specific expert demonstrations with 6 hours of human demo data in total (demonstration-only fine-tuning paradigm). 2. Human Play Play data generated by a human operator instructed to freely interact with given objects in a task-agnostic manner. 11

12.

Experiment Setup Collecting Play Data A total of 30 hours of task-agnostic autonomous play data containing diverse robot-object interactions across 3 distinct object sets Set 1: Bowl, carrot, polar bear (put the carrot/polar bear into/out of the bowl) Set 2: Rectangular block and cube (stack/unstack the block on top of the cube) Set 3: Towel (fold/unfold towel) 12

13.

Results (Q1) Q1: Does PlayWorld Induce Diverse Interactions? t-SNE Visualization of Training Sample Robot play data exhibits markedly broader behavioral coverage than human-collected trajectories. 13

14.

Results (Q2) Q2: Can PlayWorld Accurately Predict Contact Dynamics? • Interaction-centric Benchmark: 500+ clips sampled from roll-outs generated by 20+ robot policies, categorized into 6 behavior modes • • Prediction quality for successful interactions is similar, PlayWorld provides consistent improvements for other dynamic interactions. Curriculum learning offered substantial advantage in improving prediction quality for more dynamic interactions. 14

15.

Results (Q2) Q2: Can PlayWorld Accurately Predict Contact Dynamics? PlayWorld closely track ground-truth outcomes across interaction types Baseline models exhibit degraded object fidelity and unrealistic physics “Hallucinated Success” as most common failure mode 15

16.

Results (Q3) Q3: Can PlayWorld Predict the Performance of Different Policies? Success Rate Correlation between Real and World Models Real and Predicted Failure-Mode Distributions 18 suite of diverse policies (Diffusion Policy, π0 finetuned), 50 simulation + 20 real-world experiments Across diverse policies (architectures, training mixtures, and tasks), PlayWorld’s predictions closely match real-world success rates and outcome distributions. Baseline models only capture a narrow set of failure modes, with human play data yields the worst performance and often produces blurry prediction ( OOD diversity hurts performance!). 16

17.

Results (Q3) Q3: Can PlayWorld Predict the Performance of Different Policies? When policies induce underrepresented interactions, baseline models often regress to familiar outcomes or unrealistic dynamics, causing large gaps between predicted and real success rates. 17

18.

Results (Q4) Q4: Policy Finetuning in the World Model Implementation Details • Latent-steered diffusion policy (DSRL) • Simple progress-based dense reward Key Results • Starting from <10 demonstrations, fine-tuning inside PlayWorld improves real-world success by up to 65%. • The fine-tuned policy is more robust to OOD initializations and learns subtle recovery behaviors. • In contrast, fine-tuning in a weaker baseline world model is less stable and can even hurt real-world performance by exploiting model errors. If the world model is accurate enough, it can support inmodel RL fine-tuning, making policy improvement more practical than running RL directly on real robots. 18

19.

Results (Q5) Q5: Scaling & Generalization of PlayWorld Data Scaling Models trained on larger play datasets achieve steadily better prediction accuracy. Object Generalization As object diversity increases, PlayWorld learns shared interaction patterns and predicts unseen objects more accurately. 19

20.

Summary & Takeaways Key Takeaways 1. PlayWorld shows that large-scale, interaction-rich play data is a promising and scalable supervision source for learning video world models 2. While scaling data improves performance, data diversity / coverage matters just as much, especially for contact-rich interactions and failure modes. 3. Real-world data is still more effective than simulation or human videos, but teleoperation is costly and often lacks diversity. This makes scalable, autonomous real-world data generation an important direction. This is complementary to teleop scaling, not a replacement. As base models and VLAs improve, autonomous play data should improve as well. Concurrent / Future Work • Concurrent work such as Tether [Liang et al. 2026) also explores autonomous play for policy learning, reinforcing PlayWorld’s results. • A key next step is to unify play and policy learning: combine robust exploration in the low-data regime with consistent policy improvement over time. How can we continually collect data and improve policy in a closed-loop manner? (continual / test-time robot learning) 20