The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences provided by users. Central to our method is a rendering-based approach, which converts 3D semantic layout into multi-view 2D proxy maps. Furthermore, we design a semantic and depth conditioned diffusion model to generate multi-view images, which are used to learn a neural radiance field (NeRF) as the final scene representation. Through experimental analysis, we demonstrate that our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality. We will open-source our code and processed dataset.
We demonstrate SceneCraft's ability to generate more complex indoor scenes leveraging arbitrary camera trajectories. Such non-regular shape of room cannot be naturally achieved by previous works.
Generated Scene: B
For each sample we demonstrate the 3D BBS, BBI semantic map and the generated scene RGB images and rendered depth map. We show that our method is able to generate complex and free-form scenes from challenging room layouts.
For more analysis and examples of our method, please refer to our paper.
@inproceedings{yang2024scenecraft,
title={SceneCraft: Layout-Guided 3D Scene Generation},
author={Yang, Xiuyu and Man, Yunze and Chen, Jun-Kun and Wang, Yu-Xiong},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}