【DL輪読会】Learning from video: how to leverage the no-action label video data

108 Views

June 26, 25

#ロボット学習 #動画データ #モデルベース学習 #強化学習 #意味的行動

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.4K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 35.5K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 35.3K

各ページのテキスト

Learning from video: how to leverage the no-action label video data Gang Yan 20250625 Paper reading presentation

Background 1. Scaling law works on other filed. 2. Robot learning need to scale the data collection. 3. Real-world observation-action data pairs collecting is laborintensive. 4. We have large scale human videos data. Why not try to leverage such data source?

Challenging and problem 1. Video data is observation-only data, not observation-action pair. 2. How to extract the representation of “action” from video? 3. How to force robot to learn the “action”? 4. Which kind of learning format, model-based or model-free?

Action in Model-based & Model-free Model-free: Not explicitly learn representation of environment and world, input is observation, output is desired action, always are single-stage end2end training. Example: BC, PPO Model-free needs action which could be executed by robot directly, such as joint/eef position. Model-based: Explicitly learn the representation of dynamic environment and leverage this representation to help robot to do motion. Example: pre-trained video-world-model /LLM / VLM + robot motion learning Always have several training stage, or have multiple training target. Representation/execution of action and is decoupled, typical usage: learn high-level ”semantic action” and train a robot policy for mapping “semantic action” to robot action We firstly introduce some examples simply for better understanding and then introduce detail of two model-based paper. AMPLIFY: Actionless Motion Priors for Robot Learning from Videos LATENT ACTION PRETRAINING FROM VIDEOS

Egomimic: first view human/robot data Mask human arm Hand pose estimation Combine with robot data Model-free way Similar work: Egozero

Various types of semantic action interaction [1] Traces [1] affordance [1] [1] Learning Manipulation by Predicting Interaction [2] Affordances from human videos as a versatile representation for robotics [3] Any-point Trajectory Modeling for Policy Learning

How I go to introduce two paper • AMPLIFY: Actionless Motion Priors for Robot Learning from Videos (AMPLIFY) • LATENT ACTION PRETRAINING FROM VIDEOS (LAPA) I will try to answer below questions: 1. How they define the semantic action? 2. How they learn this semantic action with world model? 3. How they connect the learned action from video to robot action? 4. How they build a convinced evaluation? 5. Interesting conclusion and inspiration

How AMPLIFY define the semantic action Tracking points: track the movement of keypoints How the acquire this keypoints: uniformly distributed, since background is static, filter points by velocity to find keypoints

How AMPLIFY learn this semantic action with world model? World model: Input instruction/current image/sematic action history to predict next semantic action Tokenization by discretization/compressing/reconstruction/ with cross-entropy loss FSQ: Finite Scalar Quantization a continuous range of scalar (single-dimensional) values is mapped into a finite set of discrete values

10.

How AMPLIFY connect the learned action from video to robot action? Input: semantic actions predicted Cross-attend with: image token, prorio token, semantic actions history Output: Robot actions (in chunk format)

11.

How AMPLIFY build a convinced evaluation 1. Video generation test (baseline to other video generation model) 2. Robot experiment: Simulation (Libero), real dataset (Bridge), real robot experiment on 3 tasks: Place Cube/Stack Cups/Open the Box and Move the Eggplant into the Bowl. 3. Categorized comparison study: In-distribution/Few-shot/CrossEmbodiment transfer/Generalization Of course, AMPLIFY performance is better than SOTA baselines.

12.

AMPLIFY Interesting conclusion and inspiration 1. Given same amount of data (observation and action), IL (BC/Diffusion) could perform best in-distribution performance, but hard to generalize. 2. Large amount of video data (no action) could improve the generalization ability, even in embodiment transfer. 3. They clarify that the keypoints feature is better than pixel-wised feature. It looks like some hand-crated feature still bring benefits than learn from raw data.

13.

How LAPA define the semantic action Latent action code book: Something similar to word-embedding space, we just define the size and length of each code, and VQ-VAE could update the codebook during training. It is similar to unsupervised cluster learning method using EMA Some interesting analysis Color: learned latent action Vocab size bring more improvements compared to action length X,Y: actual robot action in 2D dimension

14.

How LAPA learn the semantic action Me feeling: semantic meaning such as “move eef left/downwards” Input: image before motion and after motion Output: re-construct after motion image using latent action and before action image VQ-VAE: hard to explain in short, please ask GPT ot refer to VQ-VAE paper

15.

How LAPA learn the semantic action with world model Given learned codebook, input is instruction and before motion image. Instruction provides the target of desired motion. Finetune VLM to predict the action (from codebook) which could maximum the possibility of realizing target instrudction VLM backbone is 7B Large World Model (LWMChat-1M)

16.

How LAPA connected with robot action Action tuning Input: latent code Output: robot action Should be cross-entropy (not regression) If my understanding is correct

17.

LAPA Interesting conclusion and inspiration 1. Still, if all data is available with observation-action pair, IL still performs best in-distribution performance. 2. When comes to cross-task/environment, IL with less data generalize worse than pretrain in large amount of no-action data. 3. IL looks tend to overfit to specific embodiment. 4. In no-action pretrain, Human video data performs worse than robot video data, due to the naturally difference of distribution. 5. Need large amount of training resources, use 8 H100 GPUs for 34 hours with a batch size of 128 (total of 272 H100-hours) for LAPA, and if pre-train baseline such as OPENVLA, it will takes much more data.