DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

1The Chinese University of Hong Kong, Shenzhen 2Voyager Research, Didi Chuxing
*Equal Contribution†Corresponding Author

Abstract

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

Method

Unified Video-Action Policy

Adapts a pretrained video diffusion transformer into an end-to-end driving policy via joint flow-matching over video and action tokens.

Scene-Evolving Guidance

A frozen VLM generates chunk-specific semantic intent injected via temporally localized cross-attention, enabling scene-level reasoning.

Selective KV Memory

Training-free modality-aware cache selection retains salient tokens and discards redundancy — 12× memory reduction for 300s rollouts.

DriveWAM Pipeline

Overview of DriveWAM. Video and action tokens are organized into a unified temporal sequence within a pretrained video diffusion transformer. A frozen VLM provides scene-evolving driving guidance via chunk-specific cross-attention, and selective KV memory maintains bounded context for long-horizon autoregressive rollout.

Attention Mask Design

Training attention mask. Block-diagonal design ensures causal dependencies while preventing future information leakage.

KV Cache Visualization

Selective KV memory. Naturally filters static background while retaining moving objects and safety-critical cues (12× memory reduction).

Results

NAVSIM v1

90.1 PDMS with a single front-view camera and a simple regression head — no anchors, no reinforcement learning.

PhysicalAI-AV

0.83m ADE@4s, substantially outperforming NVIDIA's Alpamayo-1.5 (1.44m).

Data Scaling

Consistent improvements from 4k to 100k clips, confirming scalability of video generative priors.

NAVSIM v1 Benchmark

MV: multi-view cameras; SV: single-view camera; L: LiDAR.

Method Sensors NC ↑ DAC ↑ TTC ↑ Comf. ↑ EP ↑ PDMS ↑
End-to-End Methods
UniAD MV 97.8 91.9 92.9 100.0 78.8 83.4
TransFuser MV & L 97.7 92.8 92.8 100.0 79.2 84.0
PARA-Drive MV 97.9 92.4 93.0 99.8 79.3 84.0
LAW SV 96.4 95.4 88.7 99.9 81.7 84.6
DiffusionDrive MV & L 98.2 96.2 94.7 100.0 82.2 88.1
WoTE MV & L 98.5 96.8 94.4 99.9 81.9 88.3
VLA-based Methods
ReCogDrive MV 98.1 94.7 94.2 100.0 80.9 86.5
DriveVLA-W0 SV 98.7 96.2 95.5 100.0 82.2 88.4
AutoVLA MV 98.4 95.6 98.0 99.9 81.9 89.1
DriveDreamer-Policy MV 98.4 97.1 95.1 100.0 83.5 89.2
WA-based Methods
Epona SV 97.9 95.1 93.8 99.9 80.4 86.2
WorldDrive SV 98.4 95.8 95.2 99.8 83.3 89.0
DriveWAM (Ours) SV 98.3 98.1 95.2 100.0 84.3 90.1

PhysicalAI-Autonomous-Vehicles Benchmark

Method Source # Params ADE@3s ↓ FDE@3s ↓ ADE@4s ↓ FDE@4s ↓
VaVAM Valeo 1.3B 2.31 4.32
Alpamayo-1.5 NVIDIA 10B 0.80 2.31 1.44 4.18
DriveWAM (Ours) 5B + 8B 0.47 1.35 0.83 2.47

Data Scaling

Data Scaling Results

Data scaling performance. DriveWAM shows consistent improvements when scaling from 4k to 100k driving clips, demonstrating effective utilization of pretrained video generation priors.

Ablation Studies

Scene-Evolving Guidance

# Clips SE Guidance ADE@4s ↓ FDE@4s ↓
4k1.213.65
4k1.012.95
20k0.952.94
20k0.942.65
100k0.922.75
100k0.832.47

Scene-evolving guidance consistently improves performance across all data scales.

Video Foundation Model Adaptation

Pretrained Init. Video Sup. ADE@4s ↓ FDE@4s ↓
1.103.26
1.233.79
0.832.47

Both pretrained initialization and joint video supervision are essential.

Selective KV Memory Strategy

KV Memory ADE@4s ↓ FDE@4s ↓ Mem. (GB) ↓ GFLOPs ↓
Full0.832.473.0717.37
FIFO1.403.470.251.05
Selective (Ours)0.892.520.251.44

Selective KV memory recovers full-cache accuracy while achieving 12× memory reduction for 300s rollouts. FIFO baseline degrades substantially.

Qualitative Results

Qualitative Results

Qualitative results on NAVSIM (left) and PhysicalAI-Autonomous-Vehicles (right). The predicted ego trajectories are consistent with the jointly generated future scenes.

More Qualitative Results

Additional qualitative results showcasing DriveWAM's joint video-action generation across diverse driving scenarios.

BibTeX

@article{shi2025drivewam,
  title   = {DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
  author  = {Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
  journal = {arXiv preprint arXiv:2605.28544},
  year    = {2025}
}