Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

ICCV 2025

Xiuyu Yang^*, Shuhan Tan^*, Philipp Krähenbühl

UT Austin

TL;DR InfGen performs interleaved long-term closed-loop motion simulation and scene generation with unified next-token prediction.

Abstract

An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation.

InfGen

A simulator should provide a realistic model of the environment, the ego-vehicle, and all other traffic agents throughout the trip. Existing simulators easily handle an expansive static environment and intricate ego-vehicle dynamics. However, they often lack a stable long-term simulation of non-ego traffic agents.

Comparison: Ego agents run into an EMPTY map region with surrounding agents disappearing for prior works while InfGen keeps the realisim of spatial layout (with ego agent, initially placed agents, dynamically generated agents):

Prior work 1

Prior work 2

InfGen

InfGen is a unified autoregressive transformer with interleaved token prediction. It handles temporal motion simulation and spatial scene generation in a unified model:

Overview of InfGen interleaved next-token-prediction process.

We use multiple tokenizers to convert task-specific behaviors of motion simulation and scenario generation into discrete tokens. We then add control tokens(<BEGIN MOTION>,<ADD AGENT>,<REMOVE AGENT>,<KEEP AGENT>) to mark the task switch between the two tasks, indicating what the current task is and when to switch. This design allows us to convert each real log into a single ordered sequence of tokens containing interleaved data of both tasks. We directly train InfGen with the next token prediction objective end-to-end on short-term driving logs and produce stable long-term rollouts.

Architecture

The pipeline of InfGen is built upon SMART (thanks for their open-source works!). Basically, we formulate a dynamic agent matrix and then continuously update (edit) such agent matrix along two orthogonal axes:

Overview of InfGen Architecture.

Dynamic Agent Matrix: The horizontal axis represent temporal lifecycle of each agent: being inserted, active moving and finally exit the scenario. The length of the temporal axis equals to the rollout horizon. On the other hand, the vertical axis represent spatial agent layout at each timestep, where the width represent the number of active agents at each step.

Temporal Motion Simulation: shown as blue flow. Similar to existing closed-loop simulation works, it predicts motion tokens for all existing agents and additionally predicts control tokens to decide whether each agent is going to execute the motions or to be removed. It adds/deletes columns of the agent matrix each time.

Spatial Scene Generation: shown as green flow. It predicts control tokens and pose tokens to decide whether to insert an anget or not and how to place the new agents (their initial positions and headings, etc). It adds/deletes rows of the agent matrix each time.

More Results

For more analysis, please refer to our paper.

BibTeX

@article{yang2025infgen,
  title={Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation},
  author={Yang, Xiuyu and Tan, Shuhan and Kr{\"a}henb{\"u}hl, Philipp},
  journal={arXiv preprint arXiv:2506.17213},
  year={2025}
}