ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

Abstract

Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.

ConsisDrive Overview

1. Stochastic Diversity of Generation

ConsisDrive can generate diverse videos using varying stochastic noise inputs and the same control conditions.
The videos below show the "control condition" and two sampled videos with different stochastic noise inputs in each line.
Both sampled videos in the second and third line adhere to the constraints defined in the first line.

2. Instance Identity Preservation

2.1 Comparison with Baseline

2.2 Instance Attributes Binding && Propagation

2.3 Foreground Small Objects Emphasis

ConsisDrive enhances the fidelity of small and challenging objects(e.g.,pedestrians and bicycles).
We overlay the 3D bounding box projections onto the generated videos.