Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.
Unified Video-Action Policy
Adapts a pretrained video diffusion transformer into an end-to-end driving policy via joint flow-matching over video and action tokens.
Scene-Evolving Guidance
A frozen VLM generates chunk-specific semantic intent injected via temporally localized cross-attention, enabling scene-level reasoning.
Selective KV Memory
Training-free modality-aware cache selection retains salient tokens and discards redundancy — 12× memory reduction for 300s rollouts.
Overview of DriveWAM. Video and action tokens are organized into a unified temporal sequence within a pretrained video diffusion transformer. A frozen VLM provides scene-evolving driving guidance via chunk-specific cross-attention, and selective KV memory maintains bounded context for long-horizon autoregressive rollout.
Training attention mask. Block-diagonal design ensures causal dependencies while preventing future information leakage.
Selective KV memory. Naturally filters static background while retaining moving objects and safety-critical cues (12× memory reduction).
NAVSIM v1
90.1 PDMS with a single front-view camera and a simple regression head — no anchors, no reinforcement learning.
PhysicalAI-AV
0.83m ADE@4s, substantially outperforming NVIDIA's Alpamayo-1.5 (1.44m).
Data Scaling
Consistent improvements from 4k to 100k clips, confirming scalability of video generative priors.
MV: multi-view cameras; SV: single-view camera; L: LiDAR.
| Method | Sensors | NC ↑ | DAC ↑ | TTC ↑ | Comf. ↑ | EP ↑ | PDMS ↑ |
|---|---|---|---|---|---|---|---|
| End-to-End Methods | |||||||
| UniAD | MV | 97.8 | 91.9 | 92.9 | 100.0 | 78.8 | 83.4 |
| TransFuser | MV & L | 97.7 | 92.8 | 92.8 | 100.0 | 79.2 | 84.0 |
| PARA-Drive | MV | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 |
| LAW | SV | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 |
| DiffusionDrive | MV & L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 |
| WoTE | MV & L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 |
| VLA-based Methods | |||||||
| ReCogDrive | MV | 98.1 | 94.7 | 94.2 | 100.0 | 80.9 | 86.5 |
| DriveVLA-W0 | SV | 98.7 | 96.2 | 95.5 | 100.0 | 82.2 | 88.4 |
| AutoVLA | MV | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 |
| DriveDreamer-Policy | MV | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 |
| WA-based Methods | |||||||
| Epona | SV | 97.9 | 95.1 | 93.8 | 99.9 | 80.4 | 86.2 |
| WorldDrive | SV | 98.4 | 95.8 | 95.2 | 99.8 | 83.3 | 89.0 |
| DriveWAM (Ours) | SV | 98.3 | 98.1 | 95.2 | 100.0 | 84.3 | 90.1 |
| Method | Source | # Params | ADE@3s ↓ | FDE@3s ↓ | ADE@4s ↓ | FDE@4s ↓ |
|---|---|---|---|---|---|---|
| VaVAM | Valeo | 1.3B | 2.31 | 4.32 | — | — |
| Alpamayo-1.5 | NVIDIA | 10B | 0.80 | 2.31 | 1.44 | 4.18 |
| DriveWAM (Ours) | — | 5B + 8B | 0.47 | 1.35 | 0.83 | 2.47 |
Data scaling performance. DriveWAM shows consistent improvements when scaling from 4k to 100k driving clips, demonstrating effective utilization of pretrained video generation priors.
| # Clips | SE Guidance | ADE@4s ↓ | FDE@4s ↓ |
|---|---|---|---|
| 4k | ✗ | 1.21 | 3.65 |
| 4k | ✓ | 1.01 | 2.95 |
| 20k | ✗ | 0.95 | 2.94 |
| 20k | ✓ | 0.94 | 2.65 |
| 100k | ✗ | 0.92 | 2.75 |
| 100k | ✓ | 0.83 | 2.47 |
Scene-evolving guidance consistently improves performance across all data scales.
| Pretrained Init. | Video Sup. | ADE@4s ↓ | FDE@4s ↓ |
|---|---|---|---|
| ✗ | ✓ | 1.10 | 3.26 |
| ✓ | ✗ | 1.23 | 3.79 |
| ✓ | ✓ | 0.83 | 2.47 |
Both pretrained initialization and joint video supervision are essential.
| KV Memory | ADE@4s ↓ | FDE@4s ↓ | Mem. (GB) ↓ | GFLOPs ↓ |
|---|---|---|---|---|
| Full | 0.83 | 2.47 | 3.07 | 17.37 |
| FIFO | 1.40 | 3.47 | 0.25 | 1.05 |
| Selective (Ours) | 0.89 | 2.52 | 0.25 | 1.44 |
Selective KV memory recovers full-cache accuracy while achieving 12× memory reduction for 300s rollouts. FIFO baseline degrades substantially.
Qualitative results on NAVSIM (left) and PhysicalAI-Autonomous-Vehicles (right). The predicted ego trajectories are consistent with the jointly generated future scenes.
Additional qualitative results showcasing DriveWAM's joint video-action generation across diverse driving scenarios.
@article{shi2025drivewam,
title = {DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
author = {Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
journal = {arXiv preprint arXiv:2605.28544},
year = {2025}
}