---
title: 【DL輪読会】PlayWorld: Learning Robot World Models from Autonomous Play
tags: 
author: [Deep Learning JP](https://image.docswell.com/user/DeepLearning2023)
site: [Docswell](https://www.docswell.com/)
thumbnail: https://bcdn.docswell.com/page/VJPKPLWVE8.jpg?width=480
description: 【DL輪読会】PlayWorld: Learning Robot World Models from Autonomous Play by Deep Learning JP
published: April 23, 26
canonical: https://image.docswell.com/s/DeepLearning2023/K8NV6P-2026-04-24-140644
---
# Page. 1

![Page Image](https://bcdn.docswell.com/page/VJPKPLWVE8.jpg)

DEEP LEARNING JP
[DL Papers]
PlayWorld:
Learning Robot World Models from Autonomous
Play
Jeremy Siburian, Matsuo-Iwasawa Lab, M2
http://deeplearning.jp/
1


# Page. 2

![Page Image](https://bcdn.docswell.com/page/2EVV2Q8REQ.jpg)

Paper Overview
PlayWorld: Learning Robot World Models from Autonomous Play
Paper Details
• Authors:
Tenny Yin1, Zhiting Mei1, Zhonghe Zheng1 , Miyu Yamane 1 , David Wang 1 , Jade Sceats1 , Samuel M. Bateman 1 , Lihan Zha1 , Apurva Badithela 1 , Ola
Shorinwa1 , Anirudha Majumdar 1
(1Princeton University)
•
•
Arxiv Preprint, 2026
Links:
–
–
–
ArXiv: https://arxiv.org/abs/2603.09030
Project Page: https://robot-playworld.github.io/
Code: https://github.com/irom-princeton/open-world
Disclaimer: All credits for images, figures, tables, and other contents belong to the original authors.
2


# Page. 3

![Page Image](https://bcdn.docswell.com/page/57GLRW56EL.jpg)

Introduction
•
•
Action-conditioned video models fine-tune video generation backbones with robotics datasets, but they remain
vulnerable to hallucinations when simulating contact-rich interactions.
Existing models are typically trained on human-collected demonstrations, leading to systematic biases and poor
coverage of interaction dynamics.
Cosmos Policy [Kim et al. 2026]
DreamGen [Jang et al. 2025]
3


# Page. 4

![Page Image](https://bcdn.docswell.com/page/4EQYV3Z2JP.jpg)

Introduction
Motivation:
How can we obtain the most useful data for learning high-quality world models?
Training robot world models based on autonomous play
Broad coverage of diverse contact events
Efficient &amp; diverse scaling with minimal supervision
4


# Page. 5

![Page Image](https://bcdn.docswell.com/page/KJ4WM13P71.jpg)

Related Work
What Is Autonomous Play Data In Robotics?
•
•
Play data consists of unstructured, task-agnostic robot interactions that cover a broader range of behaviors than
narrow expert demonstrations.
Recent work also uses play data to train world models for policy improvement, but still rely on human-collected play
data or supervised policy rollouts, limiting both scalability and interaction diversity.
From Play to Policy [Cui et al. 2023]
SOAR [Zhou et al. 2024]
5


# Page. 6

![Page Image](https://bcdn.docswell.com/page/LE1Y8G1X7G.jpg)

Related Work
World Models and Generalizable Dynamics
•
•
World modeling can be viewed as learning generalizable dynamics under distributional mismatch: the goal is to
train an action-conditioned model that remains accurate beyond the training distribution.
In robotics, video world models enable synthetic trajectory generation, policy evaluation, and policy
improvement, but prior work often trains them on expert demonstrations, manual rollouts, or offline datasets,
making it hard to achieve the broad state-action coverage needed for generalizable dynamics.
Ctrl-World [Guo et al. 2026]
WorldGym [Quevedo et al. 2026]
6


# Page. 7

![Page Image](https://bcdn.docswell.com/page/GEWGZK8KJ2.jpg)

PlayWorld Framework
PlayWorld is a framework for training action-conditioned video models that can
predict diverse contact dynamics with fine-grained precision.
Learning accurate world models requires
broad, unstructured interaction data from autonomous play
7


# Page. 8

![Page Image](https://bcdn.docswell.com/page/47ZL1Z8NJ3.jpg)

PlayWorld Framework
Autonomous Robot Play Data Collection
How to design πplay for reliable &amp; autonomous play data collection?
Key Design Requirements
1. The robot needs to engage in diverse interaction with objects in the scene (in
order to obtain meaningful coverage as data collection expands)
2. The system must reliably prevent and recover from potential failures when
human supervision is unavailable
3. The system should be amenable to diverse objects and generalize to diverse
language instructions ℓ without requiring manual engineering
VLM Task Proposer + VLA Task Executor
8


# Page. 9

![Page Image](https://bcdn.docswell.com/page/YJ6WLZP9JV.jpg)

PlayWorld Framework
Play Data Collection System Design
(1) Task Proposer
•
•
A VLM observes the current scene and generates a natural-language task for
the robot. It introduces small variations in the instruction to increase diversity
while keeping tasks executable.
This creates task-level exploration without needing hand-designed rewards
or extra exploration training.
(2) Task Executor
•
•
A VLA policy executes the language instruction produced by the VLM.
Small changes in wording can lead to noticeably different robot behaviors,
leading to more diverse contact dynamics under different object
configurations.
(3) Safety Filter &amp; Resets
•
•
A lightweight safety filter enforces conservative workspace and joint limits
for reliable unsupervised execution.
The VLM detects when objects drift out of reach, and the VLA performs a
scene reset by bringing them back. This allows long-duration autonomous
collection, including overnight runs.
9


# Page. 10

![Page Image](https://bcdn.docswell.com/page/GJ5M1WKDJ4.jpg)

PlayWorld Framework
Video Model Architecture / Training
•
•
•
Uses a Stable Video Diffusion (SVD) backbone as the world model,
initialized from large-scale robot data and fine-tuned on PlayWorld data.
The model is action-conditioned and jointly predicts multi-view
observations, allowing it to model scene evolution under manipulation.
Training targets both common motions and contact-rich interaction
dynamics.
Curriculum Learning
•
•
•
Play data is long-tailed: it contains many easy, repetitive transitions and
fewer rare, diverse interactions.
PlayWorld builds a curriculum by ranking play samples by their distance
to successful demonstrations in feature space.
Training starts with more success-like samples and gradually shifts
toward harder, more exploratory ones, improving learning of rare
contact dynamics.
Curriculum learning helps the model focus
on rarer, harder interactions instead of
overfitting to easy ones.
10


# Page. 11

![Page Image](https://bcdn.docswell.com/page/9E291Q6M7R.jpg)

Experiment Setup
Research Questions
1. Does PlayWorld induce more diverse object interactions compared to human-collected data?
2. Can PlayWorld improve video prediction accuracy for diverse object interactions compared to models trained on human
demonstration data?
3. Can PlayWorld enable fine-grained policy evaluation by reliably predicting outcomes across a broad range of policies and tasks?
4. Can PlayWorld enable policy fine-tuning through interactive roll-outs in the video model?
5. Does PlayWorld result in improved accuracy and generalization with data scale compared to human collected data?
Baselines
Two types of teleoperation data as baselines:
1. Human Demo
Task-specific expert demonstrations with 6 hours of human demo data in total (demonstration-only fine-tuning paradigm).
2. Human Play
Play data generated by a human operator instructed to freely interact with given objects in a task-agnostic manner.
11


# Page. 12

![Page Image](https://bcdn.docswell.com/page/D7Y4ZW9PEM.jpg)

Experiment Setup
Collecting Play Data
A total of 30 hours of task-agnostic autonomous play data containing diverse robot-object interactions across 3 distinct object sets
Set 1: Bowl, carrot, polar bear (put the carrot/polar
bear into/out of the bowl)
Set 2: Rectangular block and cube (stack/unstack
the block on top of the cube)
Set 3: Towel (fold/unfold towel)
12


# Page. 13

![Page Image](https://bcdn.docswell.com/page/VENY39LMJ8.jpg)

Results (Q1)
Q1: Does PlayWorld Induce Diverse Interactions?
t-SNE Visualization of Training Sample
Robot play data exhibits markedly broader behavioral coverage than
human-collected trajectories.
13


# Page. 14

![Page Image](https://bcdn.docswell.com/page/Y79P924WE3.jpg)

Results (Q2)
Q2: Can PlayWorld Accurately Predict Contact Dynamics?
•
Interaction-centric Benchmark: 500+ clips sampled from roll-outs generated by 20+ robot policies, categorized into 6 behavior modes
•
•
Prediction quality for successful interactions is similar, PlayWorld provides consistent improvements for other dynamic interactions.
Curriculum learning offered substantial advantage in improving prediction quality for more dynamic interactions.
14


# Page. 15

![Page Image](https://bcdn.docswell.com/page/G78D95QR7D.jpg)

Results (Q2)
Q2: Can PlayWorld Accurately Predict Contact Dynamics?
PlayWorld closely track ground-truth
outcomes across interaction types
Baseline models exhibit degraded
object fidelity and unrealistic physics
“Hallucinated Success” as most
common failure mode
15


# Page. 16

![Page Image](https://bcdn.docswell.com/page/L7LMWYX2JR.jpg)

Results (Q3)
Q3: Can PlayWorld Predict the Performance of Different Policies?
Success Rate Correlation between Real and World Models
Real and Predicted Failure-Mode Distributions
18 suite of diverse policies (Diffusion Policy, π0 finetuned), 50 simulation + 20 real-world experiments
Across diverse policies (architectures, training mixtures, and tasks), PlayWorld’s predictions closely match real-world success rates
and outcome distributions.
Baseline models only capture a narrow set of failure modes, with human play data yields the worst performance and often
produces blurry prediction ( OOD diversity hurts performance!).
16


# Page. 17

![Page Image](https://bcdn.docswell.com/page/4EMY9NL9EW.jpg)

Results (Q3)
Q3: Can PlayWorld Predict the Performance of Different Policies?
When policies induce underrepresented interactions, baseline models often regress to familiar outcomes or unrealistic
dynamics, causing large gaps between predicted and real success rates.
17


# Page. 18

![Page Image](https://bcdn.docswell.com/page/PER9GDK9J9.jpg)

Results (Q4)
Q4: Policy Finetuning in the World Model
Implementation Details
• Latent-steered diffusion policy (DSRL)
• Simple progress-based dense reward
Key Results
• Starting from &lt;10 demonstrations, fine-tuning inside
PlayWorld improves real-world success by up to 65%.
• The fine-tuned policy is more robust to OOD initializations
and learns subtle recovery behaviors.
• In contrast, fine-tuning in a weaker baseline world model is
less stable and can even hurt real-world performance by
exploiting model errors.
If the world model is accurate enough, it can support inmodel RL fine-tuning, making policy improvement more
practical than running RL directly on real robots.
18


# Page. 19

![Page Image](https://bcdn.docswell.com/page/P7XQX1L3EX.jpg)

Results (Q5)
Q5: Scaling &amp; Generalization of PlayWorld
Data Scaling
Models trained on larger play datasets achieve
steadily better prediction accuracy.
Object Generalization
As object diversity increases, PlayWorld learns shared interaction
patterns and predicts unseen objects more accurately.
19


# Page. 20

![Page Image](https://bcdn.docswell.com/page/37K9W2LN7D.jpg)

Summary &amp; Takeaways
Key Takeaways
1. PlayWorld shows that large-scale, interaction-rich play data is a promising and scalable supervision source for learning video world models
2. While scaling data improves performance, data diversity / coverage matters just as much, especially for contact-rich interactions and failure modes.
3. Real-world data is still more effective than simulation or human videos, but teleoperation is costly and often lacks diversity. This makes scalable,
autonomous real-world data generation an important direction. This is complementary to teleop scaling, not a replacement. As base models and
VLAs improve, autonomous play data should improve as well.
Concurrent / Future Work
•
Concurrent work such as Tether [Liang et al. 2026) also explores autonomous play for policy learning, reinforcing PlayWorld’s results.
•
A key next step is to unify play and policy learning: combine robust exploration in the low-data regime with consistent policy improvement over time.
How can we continually collect data and improve policy in a closed-loop manner?
(continual / test-time robot learning)
20


