-- Views
February 12, 26
スライド概要
DL輪読会資料
DEEP LEARNING JP [DL Papers] Ctrl-World: A Controllable Generative World Model for Robot Manipulation Shu MORIKUNI, Matsuo-Iwasawa Lab http://deeplearning.jp/ 1
Bibliography Information Title Authors Ctrl-World: A Controllable Generative World Model for Robot Manipulation Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn Affliations Stanford University, Tsinghua University Publication arXiv preprint (2025/10/15) Arxiv Project Page GitHub Summary https://arxiv.org/abs/2510.10125 https://ctrl-world.github.io/ https://github.com/Robert-gyj/Ctrl-World A controllable multi-view synthesis world model that is used for generalist robot policy evaluation & performance improvement. 2
When Video Generation Models Meet Robotic Models 2014 2012 GAN AlexNet 2016 AlphaGo StableDiffusion 2018 2019 2020 Transformer GPT-1 SytleGan DALL-E Runway 2022 2023 ChatGPT RT-1 2017 Sora 2024 Gemini Video Generation Model Cosmos 2025 RT-2 OpenVLA Pi0.5 2026 Robotic Model 3
Related Works & Trend Peer-reiewed Works • [ICLR2025 Paper] https://leobarcellona.github.io/DreamToManipulate/ • **[ICCV2025 Paper] https://gen-irasim.github.io/ • [CoRL2025 Workshop] https://research.nvidia.com/labs/gear/dreamgen/ • [CoRL2025 Workshop] https://research.nvidia.com/labs/gear/flare/ Preprint Works • [Arxiv Preprint 2025-12] https://dream2flow.github.io/ • [Arxiv Preprint 2025-09] https://world-model-eval.github.io/abstract • [Arxiv Preprint 2026-01] https://point-world.github.io/ • [Arxiv Preprint 2026-01] https://research.nvidia.com/labs/dir/plenopticdreamer/ • [Arxiv Preprint 2026-02] https://dreamdojo-world.github.io/ 4
New Topic Added to a Top-tier Robotic Conference This Year Video models and latent world models for robot learning https://www.corl.org/contributions/call-for-papers 5
Quick Demo URL: https://ctrl-world.github.io/ 6
Overview A controllable multi-view synthesis world-model which can be used for (1) robot policy evaluation and (2) improve downstream policy performance. 7
Background The Difficulty in Effectively Scaling Real-world Generalist Robot Policy Evaluation and Performance Generalist robot policies can perform wide range of manipulation tasks. But evaluating and improving their ability with novel objects and instructions remain a significant challenge. Systematic improvement demands additional expert labels. Rigorous evaluation requires a large number of real-world rollouts. 8
Challenge The challenge is building a controllable world model to handle multi-step interactions with generalist robot policies. • Controllable world model is not new, what is the challenge? • Most approaches are not sufficient to actively interact with generalist robot policies. A.K.A policy-in-the-loop rollouts. • Most models typically simulate only a single third-person camera view. • Single-view input is also incompatible with many modern VLA policies. • Existing models lack the fine-grained control required for high-frequency actions. • They struggle to maintain temporal consistency across long-horizon video generations. 9
Contributions 1. Long-horizon Consistency Ctrl-World show that it generalizes to novel scenes and camera placements, sustaining coherent rollouts for over 20 seconds. 2. Pose-conditioned Memory Retrieval Ctrl-World reflect policies’ real-world “instruction-following” ability. 3. Downstream Robot Policy Improvement Ctrl-World can be used to improve the performance of π0.5-DROID (Intelligence et al., 2025) on downstream tasks with unseen objects and novel instructions by 44.7%. 10
Proposed Approach 1. Multi-view Joint Predictions Captures comprehensive visual representation of the scene paired with VLA policies, reduces hallucinations during contactrich object interactions. 2. Pose-conditioned Memory Retrieval Aligns visual dynamics with control signals, ensuring that generated rollouts reflect the causal effect of each action. 3. Frame-level Action Conditioning Adds sparse history frames into the context and projects corresponding pose information into each frame , This stabilizes longhorizon rollouts and preserves temporal consistency. 11
Proposed Approach 12
Problem Formulation Modern Generalist Robot Policy ( π ) The Goal of Crtl-World Model ( W ) The Training Objective of Crtl-World Model ( W ) 13
Experimental Setup 1. DROID Platform 1. 95k trajectories in total (76k successful + 19k failed trajectories). 2. 564 scenes. 2. DROID Dataset 1. 95k trajectories in total (76k successful + 19k failed trajectories). 2. 564 scenes. DROID Dataset [URL] 3. Training Details 1. 192x320 resolution 2. 7 frames of historical window with an interval of 1-2 seconds. 3. 15 step action chunk size ~= 1 second action chunk in DROID. 4. 2x8 H100 GPUs, 64 batch size for 2-3 days. DROID Platform used in this work [URL] 14
Ctrl-World Quantitative Result Quantitative results for interactive long-trajectory generation *PSNR (Peak Signal-to-Noise Ratio) *SSIM (Structural Similarity Index) *LPIPS (Learned Perceptual Image Patch Similarity) *FID (Fréchet Inception Distance) *FVD (Fréchet Video Distance) 15
Ctrl-World Ablation Study Result Ablation Study without (1) memory retrieval (2) frame-level condition (3) muti-view 16
Ctrl-World Qualitative Result Controllability contributed by memory retrieval & frame-level pose condition 17
Ctrl-World Qualitative Result Consistency contributed by multi-view & memory retrieval 18
Correlation Result with Real-world Evaluation Quantitative correlations between real-world and world-model rollouts 19
Real-world Correlation Result Inconsistent Tasks Example “close the laptop” Ground-truth Prediction 20
Policy Improvement Algorithm 21
Policy Improvement Result Post-training result of the VLA model tested on unfamiliar objects and novel instructions 1. Rollout diversity is encouraged by 1. Rephrasing task instructions. 2. Resetting with new initial state. 2. 400 trajectories per task are generated. 3. 25–50 successful trajectories are retained based on human preference judgments. 4. Translates to 6%~12% sampling efficiency 22
Policy Improvement Result Baseline Policy Post-trained Policy 1. Pick the object in top left side and place in box. 2. Pick glove and place in box. 23
Summary Summary • CTRL-WORLD provides multi-view, controllable generative world model designed specifically for closed-loop interactions with modern VLA policies, which shows promising potential for both policy evaluation and policy improvement. Limitations • Complex physics, (e.g., collisions, objects sliding away, rotations) • Retrying scenarios. Final Thought • Expectation on more known “priors” to be leveraged in solving physic consistency. • Encouragement of leveraging failed trajectories and emphasis on recovery. 24