Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D Occupancy-centric framework for Robot Video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8% lower FVD than state of the art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps.
All shown occupancy have resolution of T×400×400×400.
All videos are displayed at the resolution of 320×480, regardless of their original resolutions which may result in varying visual quality.
All videos are displayed at the resolution of 320×480, regardless of their original resolutions which may result in varying visual quality.
All videos are displayed at the resolution of 320×480, regardless of their original resolutions which may result in varying visual quality.
All videos are displayed at the resolution of 320×480, regardless of their original resolutions which may result in varying visual quality.
@article{yang2025orv,
title={ORV: 4D Occupancy-centric Robot Video Generation},
author={Yang, Xiuyu and Li, Bohan and Xu, Shaocong and Wang, Nan and Ye, Chongjie and Chen Zhaoxi and Qin, Minghan and Ding Yikang and Jin, Xin and Zhao, Hang and Zhao, Hao},
journal={arXiv preprint arXiv:2506.03079},
year={2025}
}