ORV: 4D Occupancy-centric Robot Video Generation

1BAAI, 2THU, 3SJTU, 4EIT(Ningbo), 5CUHK(SZ), 6NTU, 7ByteDance, 8Megvii, 8GigaAI, *Equal Contribution

Table of Contents

Teaser Image

TL;DR We propose ORV, a novel framework of robot video generation with the geometry guidance of 4D occupancy, which achieves higher control precision, performs multiview generation and conducts simulation-to-real visual transfer.

Abstract

Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D Occupancy-centric framework for Robot Video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8% lower FVD than state of the art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps.


4D Occupancy Samples

All shown occupancy have resolution of T×400×400×400.



Bridge Sample #667
Occupancy

Bridge Sample #667
RGB

Bridge Sample #1995
Occupancy

Bridge Sample #1995
RGB

Bridge Sample #504
Occupancy

Bridge Sample #504
RGB

Bridge Sample #534
Occupancy

Bridge Sample #534
RGB

Bridge Sample #792
Occupancy

Bridge Sample #792
RGB

Bridge Sample #1122
Occupancy

Bridge Sample #1122
RGB

Bridge Sample #1055
Occupancy

Bridge Sample #1055
RGB

Bridge Sample #656
Occupancy

Bridge Sample #656
RGB

Bridge Sample #1229
Occupancy

Bridge Sample #1229
RGB

Bridge Sample #1545
Occupancy

Bridge Sample #1545
RGB

Video Comparison

Examples of prediction-GT video pairs

All videos are displayed at the resolution of 320×480, regardless of their original resolutions which may result in varying visual quality.



Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Bridge
Prediction

Bridge
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

Droid
Prediction

Droid
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

RT-1
Prediction

RT-1
Ground-truth

BibTeX

@article{yang2025orv,
    title={ORV: 4D Occupancy-centric Robot Video Generation},
    author={Yang, Xiuyu and Li, Bohan and Xu, Shaocong and Wang, Nan and Ye, Chongjie and Chen Zhaoxi and Qin, Minghan and Ding Yikang and Jin, Xin and Zhao, Hang and Zhao, Hao},
    journal={arXiv preprint arXiv:2506.03079},
    year={2025}
}