InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

Abstract

Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos for tasks like perception, tracking, and planning. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider module, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner module, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare yet safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems.

InstaDrive Overview InstaDrive method

1. Multimodal Condition Controllability

1.1 Layout Controllability

InstaDrive responds precisely to control conditions like box projection, map projection, and instance flow.
We overlay the 3D bounding box projections onto the generated videos.

2. Qualitative Comparison

2.1 Comparison with Baseline

2.2 More Results

3. Scenario Simulation Using Carla-Generated Layouts

InstaDrive can generate driving videos based on layout conditions provided by the Carla Simulator.

3.1 Corner Case in Autonomous Driving

InstaDrive can generate rare but critical driving scenarios based on the Carla-Generated conditions.
We leverage CARLA's waypoint mechanism, allowing us to randomly generate various events across different maps and regions.

3.2 Long-term Generation (2x speed)

InstaDrive can simulate long-duration driving scenarios based on the Carla-Generated layout conditions.
We generates new plausible content on the fly and maintains a consistent world for up to 1 min.
This duration significantly exceeds that of videos in the NuScenes dataset.