Vision–Language–Action Models for Autonomous Driving

Flex: Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Jiawei Yang^π^* Ziyu Chen^ρ^* Yurong You^* Yan Wang^* Yiming Li^* Yuxiao Chen^* Boyi Li^*

Boris Ivanovic^* Marco Pavone^ρ^* Yue Wang^π^*

π USC Physical Superintelligence (PSI) Lab · ρ Stanford University · * NVIDIA Research

A lightweight, geometry-agnostic Transformer that compresses all camera views and timesteps into a compact set of learnable scene tokens. Flex removes 3D/4D priors and delivers a better efficiency–accuracy trade-off for LLM-based driving policies.

Paper (PDF) arXiv See results

2.2x

faster inference throughput (41.08 clips/sec vs 18.60 baseline)

4.64%

reduced driving errors (0.761 minADE6 vs 0.798 baseline)

68.75%

fewer scene tokens (900 vs 2880) for 18 input images

20k

hours across 1,700 cities, 25 countries

Abstract

We introduce Flex, a scene encoder for end-to-end driving that addresses the cost of processing high-bandwidth multi-camera inputs. Flex prepends a small set of learnable scene tokens to all image tokens across views and timesteps and performs lightweight joint self-attention. After encoding, only the updated scene tokens are passed to the LLM policy, enforcing a data-driven compression bottleneck—no BEV, voxels, or tri/hex-planes are required. On a proprietary dataset of 20,000 driving hours, Flex achieves 2.2× higher inference throughput (41.08 vs. 18.60 clips/s) while improving driving accuracy (minADE6 0.761 vs. 0.798), and shows emergent specialization toward destinations, lane markers, and safety-critical regions.

Method in 30 seconds

Flex overview — **Figure 1.** Flex concatenates K learnable scene tokens with all image tokens (multi-view × multi-timestep), runs a small Transformer encoder with joint self-attention, then *keeps only* the updated scene tokens as the scene representation for the policy head.

Joint, geometry-agnostic compression

Rather than imposing 3D priors (BEV/voxels/planes), Flex lets data decide what to keep. Joint attention across cameras and time suppresses redundancy and yields a compact, action-relevant scene summary.

Compact token budget

Typical setup: 2 cameras × 9 timesteps → K=900 scene tokens (<1K to the policy).
Baseline passes ~2880 tokens directly to the policy.

Interleaved prediction

Train with supervision at every prefix (varying context lengths) via attention masks. This increases supervision density and robustness and is key to Flex’s gains.

Emergent scene decomposition

Scene tokens specialize—top-ranked tokens focus on destination; mid-ranked exhibit look-ahead scanning; lower-ranked capture lane markings—without explicit supervision.

Results

System-level (2 cams × 9 steps, LLM policy)

Throughput: 41.08 vs. 18.60 clips/s (2.2×).
Driving error: minADE6 0.761 vs. 0.798 (↓ better).
Training cost: ~60% of baseline (Stage-1); fewer tokens to the policy dominate savings.

Numbers from Table 1 & main text.

Ablations (high-level)

K tokens: accuracy improves with K and saturates ≈900; throughput decreases with K.
Encoder depth: gains up to ~8 layers; deeper brings little extra.
Attention design: joint self-attention > cross-attention; per-image compression underperforms joint scene encoding.
Interleaving: critical for both baseline and Flex (e.g., Flex minADE6 0.991 → 0.833 with interleaving at stage-1).

Scalability across cameras

As camera count increases (2→4→7), Flex maintains accuracy and widens the throughput gap (up to ~3.4×), while baseline accuracy degrades due to exploding token counts.

Pareto curves and ablations — Pareto frontiers for patchifier size, token count, encoder depth; attention/interleave/camera ablations.

Emergent specialization heatmaps — Emergent specialization: destination-focused tokens (ranks 1–3), look-ahead scanning (mid-ranks), lane-marking tokens (lower-ranks).

BibTeX

@article{yang2025flex,
  title     = {Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving},
  author    = {Yang, Jiawei and Chen, Ziyu and You, Yurong and Wang, Yan and Li, Yiming and Chen, Yuxiao and Li, Boyi and Ivanovic, Boris and Pavone, Marco and Wang, Yue},
  journal = {arXiv preprint arXiv:2512.10947},
  year      = {2025},
  note      = {Flex scene encoder with compact learned tokens for LLM-based driving policies},
}