DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Abstract

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

Method

Unified Video-Action Policy

Adapts a pretrained video diffusion transformer into an end-to-end driving policy via joint flow-matching over video and action tokens.

Scene-Evolving Guidance

A frozen VLM generates chunk-specific semantic intent injected via temporally localized cross-attention, enabling scene-level reasoning.

Selective KV Memory

Training-free modality-aware cache selection retains salient tokens and discards redundancy — 12× memory reduction for 300s rollouts.

Overview of DriveWAM. Video and action tokens are organized into a unified temporal sequence within a pretrained video diffusion transformer. A frozen VLM provides scene-evolving driving guidance via chunk-specific cross-attention, and selective KV memory maintains bounded context for long-horizon autoregressive rollout.

Training attention mask. Block-diagonal design ensures causal dependencies while preventing future information leakage.

Selective KV memory. Naturally filters static background while retaining moving objects and safety-critical cues (12× memory reduction).

Results

NAVSIM v1

90.1 PDMS with a single front-view camera and a simple regression head — no anchors, no reinforcement learning.

PhysicalAI-AV

0.83m ADE@4s, substantially outperforming NVIDIA's Alpamayo-1.5 (1.44m).

Data Scaling

Consistent improvements from 4k to 100k clips, confirming scalability of video generative priors.

NAVSIM v1 Benchmark

MV: multi-view cameras; SV: single-view camera; L: LiDAR.

Method	Sensors	NC ↑	DAC ↑	TTC ↑	Comf. ↑	EP ↑	PDMS ↑
End-to-End Methods
UniAD	MV	97.8	91.9	92.9	100.0	78.8	83.4
TransFuser	MV & L	97.7	92.8	92.8	100.0	79.2	84.0
PARA-Drive	MV	97.9	92.4	93.0	99.8	79.3	84.0
LAW	SV	96.4	95.4	88.7	99.9	81.7	84.6
DiffusionDrive	MV & L	98.2	96.2	94.7	100.0	82.2	88.1
WoTE	MV & L	98.5	96.8	94.4	99.9	81.9	88.3
VLA-based Methods
ReCogDrive	MV	98.1	94.7	94.2	100.0	80.9	86.5
DriveVLA-W0	SV	98.7	96.2	95.5	100.0	82.2	88.4
AutoVLA	MV	98.4	95.6	98.0	99.9	81.9	89.1
DriveDreamer-Policy	MV	98.4	97.1	95.1	100.0	83.5	89.2
WA-based Methods
Epona	SV	97.9	95.1	93.8	99.9	80.4	86.2
WorldDrive	SV	98.4	95.8	95.2	99.8	83.3	89.0
DriveWAM (Ours)	SV	98.3	98.1	95.2	100.0	84.3	90.1

PhysicalAI-Autonomous-Vehicles Benchmark

Method	Source	# Params	ADE@3s ↓	FDE@3s ↓	ADE@4s ↓	FDE@4s ↓
VaVAM	Valeo	1.3B	2.31	4.32	—	—
Alpamayo-1.5	NVIDIA	10B	0.80	2.31	1.44	4.18
DriveWAM (Ours)	—	5B + 8B	0.47	1.35	0.83	2.47

Data Scaling

Data scaling performance. DriveWAM shows consistent improvements when scaling from 4k to 100k driving clips, demonstrating effective utilization of pretrained video generation priors.

Ablation Studies

Scene-Evolving Guidance

# Clips	SE Guidance	ADE@4s ↓	FDE@4s ↓
4k	✗	1.21	3.65
4k	✓	1.01	2.95
20k	✗	0.95	2.94
20k	✓	0.94	2.65
100k	✗	0.92	2.75
100k	✓	0.83	2.47

Scene-evolving guidance consistently improves performance across all data scales.

Video Foundation Model Adaptation

Pretrained Init.	Video Sup.	ADE@4s ↓	FDE@4s ↓
✗	✓	1.10	3.26
✓	✗	1.23	3.79
✓	✓	0.83	2.47

Both pretrained initialization and joint video supervision are essential.

Selective KV Memory Strategy

KV Memory	ADE@4s ↓	FDE@4s ↓	Mem. (GB) ↓	GFLOPs ↓
Full	0.83	2.47	3.07	17.37
FIFO	1.40	3.47	0.25	1.05
Selective (Ours)	0.89	2.52	0.25	1.44

Selective KV memory recovers full-cache accuracy while achieving 12× memory reduction for 300s rollouts. FIFO baseline degrades substantially.

Qualitative Results

Qualitative results on NAVSIM (left) and PhysicalAI-Autonomous-Vehicles (right). The predicted ego trajectories are consistent with the jointly generated future scenes.

Additional qualitative results showcasing DriveWAM's joint video-action generation across diverse driving scenarios.

BibTeX

@article{shi2025drivewam,
  title   = {DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
  author  = {Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
  journal = {arXiv preprint arXiv:2605.28544},
  year    = {2025}
}