ORV: 4D Occupancy-centric Robot Video Generation

¹BAAI, ²THU, ³SJTU, ⁴EIT(Ningbo), ⁵CUHK(SZ), ⁶NTU, ⁷ByteDance, ⁸Megvii, ⁸GigaAI, ^*Equal Contribution

Abstract

Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D Occupancy-centric framework for Robot Video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8% lower FVD than state of the art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps.

BibTeX

@article{yang2025orv, title={ORV: 4D Occupancy-centric Robot Video Generation}, author={Yang, Xiuyu and Li, Bohan and Xu, Shaocong and Wang, Nan and Ye, Chongjie and Chen Zhaoxi and Qin, Minghan and Ding Yikang and Jin, Xin and Zhao, Hang and Zhao, Hao}, journal={arXiv preprint arXiv:2506.03079}, year={2025} }

ORV: 4D Occupancy-centric Robot Video Generation

Table of Contents

TL;DR We propose ORV, a novel framework of robot video generation with the geometry guidance of 4D occupancy, which achieves higher control precision, performs multiview generation and conducts simulation-to-real visual transfer.

Abstract

4D Occupancy Samples

Video Comparison

Examples of prediction-GT video pairs

Video Generation Gallery 1

Examples of single-view videos generation

Video Generation Gallery 2

Examples of multi-view videos generation

Video Generation Gallery 3

Examples of (appearance) augmented manipulation videos

BibTeX