Latent Denoising Makes Good Tokenizers
A simple tokenizer trained to reconstruct clean images from heavily corrupted latent embeddings—aligning tokens with downstream denoising objectives and improving image generation across six representative models.
Jiawei Yang1, Tianhong Li2, Lijie Fan3, Yonglong Tian4, Yue Wang1
1University of Southern California 2MIT CSAIL 3Google DeepMind 4OpenAI
Abstract
Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective—reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking—a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
Method in 30 seconds
Motivation: Why train tokenizers for denoising?
Modern diffusion and autoregressive models both learn to reconstruct clean signals from corrupted contexts (noise or masking). Tokenizers are usually optimized for pixel MSE, not for what generators actually learn. We ask what makes a tokenizer effective for generation. Our answer: make tokens easy to recover under the same kinds of degradations used by downstream models.
Unified denoising view
Non-Autoregressive (Non-AR) models remove injected noise; Autoregressive (AR) models fill in masked context. Both are reconstruction-from-deconstruction. We therefore train the tokenizer so that its latents remain reconstructable even when heavily corrupted.
l-DeTok in a nutshell
- Design principle: Treat tokenization as reconstruction-from-deconstruction: deliberately corrupt latents, then learn to reconstruct the original image.
- Interpolative latent noise: mix encoder latents with Gaussian noise at a random strength; the decoder learns to reconstruct original pixels.
- Optional random masking: hide a random fraction of patches; provide shared
[MASK]tokens to the decoder for context completion.
Why this improves generation
By training tokens to survive strong corruption, the tokenizer's latent embeddings match downstream training (noise removal or mask filling). This alignment eases optimization and yields better sample quality across both AR and non-AR families—without relying on large pretrained vision encoders.
Minimal recipe (You can implement this in less than 10 lines of code)
- Corrupt latents with interpolative noise (and optional masking), then decode to original signals, e.g. pixels.
Interactive demo — Deconstruction & Reconstruction!
Each sample has a 21×21 grid: mask∈{0..1 by 0.05}, τ∈{0..1 by 0.05}. Sliders snap to these values.
*Feature maps are visualized using PCA dimensionality reduction to 3 RGB components, computed from clean features (τ=0, mask ratio=0%) and applied consistently across all noise conditions. The 16×16 feature maps are upsampled to 256×256 using nearest-neighbor interpolation. Masked regions are dimmed to 30% intensity to indicate missing tokens.
Results
Plug-and-play across 6 models (vs SD-VAE)
Drop-in replacement for SD-VAE; no architecture changes. ImageNet 256x256. All models trained for 100 epochs.
FID-50K with optimal CFG. Lower FID is better.
- MAR-B: 4.64 → 2.43 (+47.6%)
- RandomAR-B: 13.11 → 5.22 (+60.2%)
- RasterAR-B: 8.26 → 4.46 (+46.0%)
- SiT-B: 7.66 → 5.13 (+33.0%)
- DiT-B: 8.33 → 6.58 (+21.0%)
- LightningDiT-B: 4.24 → 3.63 (+14.4%)
Avg ΔFID: AR Models +51.3%, Non-AR Models +22.8%, overall +37.0%.
System-level (ImageNet)
- MAR-B + l-DeTok: FID 1.61 — new SoTA with 208M params.
- MAR-L + l-DeTok: FID 1.35 — new SoTA with 479M params; beats MAR-H (1.55) and MAR-L (1.78) trained with MAR-VAE.
- MAR-L + l-DeTok (ft): FID 1.61 / IS 315.7 vs MAR-L + MAR-VAE (1.73 / 279.9).
- MAR-B + l-DeTok (scratch): FID 1.83 / IS 279.6, competitive with larger diffusion systems.
Text-to-Image (MS-COCO 256x256, vs MAR-VAE)
- MAR-B: FID 4.97 (+60%); CLIP 24.82 (+12.5%).
- SiT-B: FID 4.31 (+25%); CLIP 24.61 (+4.7%).
- Improves both diversity (FID↓) and alignment (CLIP↑) without external semantic distillation.
Parameter efficiency
- MAR-L + l-DeTok hits FID 1.35 at 479M params — better FID than VAR-d30 (1.92 @ 2.0B) with ~4.2x fewer params.
- MAR-B + l-DeTok delivers FID 1.61 at 208M — surpasses LlamaGen-3B (2.18 @ 3.1B) with ~15x fewer params.
Our l-DeTok serves as a drop-in tokenizer that unifies latent representations across generative paradigms—yielding more expressive tokens and better downstream generative performance.
Generalizability across 6 generative models
All numbers on ImageNet 256x256; lower is better. CFG tuned per model. 100 epochs.
| Tokenizer | rFID | Autoregressive Models | Non-autoregressive Models | ||||
|---|---|---|---|---|---|---|---|
| MAR-B | RandomAR-B | RasterAR-B | SiT-B | DiT-B | LightningDiT-B | ||
| Tokenizers trained with semantics distillation from external pretrained models | |||||||
| VA-VAE | 0.28 | 16.66 | 38.13 | 15.88 | 4.33 | 4.91 | 2.86 |
| MAETok | 0.48 | 6.99 | 24.83 | 15.92 | 4.77 | 5.24 | 3.92 |
| Our l-DeTok + Distillation | 0.85 | 2.52 | 5.57 | 11.99 | 3.40 | 3.91 | 2.18 |
| Tokenizers trained without semantics distillation | |||||||
| SD-VAE | 0.61 | 4.64 | 13.11 | 8.26 | 7.66 | 8.33 | 4.24 |
| MAR-VAE | 0.53 | 3.71 | 11.78 | 7.99 | 6.26 | 8.20 | 3.98 |
| Our l-DeTok | 0.68 | 2.43 | 5.22 | 4.46 | 5.13 | 6.58 | 3.63 |
Our l-DeTok generalizes better than prior tokenizers and benefits from privileged info like semantic distillation when available.
System-level comparison — ImageNet 256x256
w/o vs. w/ Classifier-Free Guidance (CFG).
| Model | #Params | w/o CFG | w/ CFG | ||
|---|---|---|---|---|---|
| FID ↓ | IS ↑ | FID ↓ | IS ↑ | ||
| With semantics distillation from external pretrained models | |||||
| SiT-XL + REPA | 675M | 5.90 | 157.8 | 1.42 | 305.7 |
| SiT-XL + MAETok | 675M | 2.31 | 216.5 | 1.67 | 311.2 |
| LightningDiT + MAETok | 675M | 2.21 | 208.3 | 1.73 | 308.4 |
| LightningDiT + VAVAE | 675M | 2.17 | 205.6 | 1.35 | 295.3 |
| DDT-XL | 675M | 6.27 | 154.7 | 1.26 | 310.6 |
| Without semantics distillation | |||||
| DiT-XL/2 | 675M | 9.62 | 121.5 | 2.27 | 278.2 |
| SiT-XL/2 | 675M | 8.30 | – | 2.06 | 270.3 |
| VAR-d30 | 2.0B | – | – | 1.92 | 323.1 |
| LlamaGen-3B | 3.1B | – | – | 2.18 | 263.3 |
| RandAR-XXL | 1.4B | – | – | 2.15 | 322.0 |
| CausalFusion | 676M | 3.61 | 180.9 | 1.77 | 282.3 |
| MAR-B + MAR-VAE | 208M | 3.48 | 192.4 | 2.31 | 281.7 |
| MAR-L + MAR-VAE | 479M | 2.60 | 221.4 | 1.78 | 296.0 |
| MAR-H + MAR-VAE | 943M | 2.35 | 227.8 | 1.55 | 303.7 |
| MAR-B + l-DeTok | 208M | 2.79 | 195.9 | 1.61 | 289.7 |
| MAR-B + l-DeTok† | 208M | 2.94 | 195.5 | 1.55 | 291.0 |
| MAR-L + l-DeTok | 479M | 1.84 | 238.4 | 1.43 | 303.5 |
| MAR-L + l-DeTok† | 479M | 1.86 | 238.6 | 1.35 | 304.1 |
† With additional decoder fine-tuning.
ImageNet 512x512 (w/ CFG)
All numbers reported at 512x512 resolution.
| Model | #Params | FID ↓ | IS ↑ |
|---|---|---|---|
| ADM | 554M | 7.72 | 172.7 |
| DiT-XL/2 | 675M | 3.04 | 240.8 |
| SiT-XL/2 | 675M | 2.62 | 252.2 |
| SiT-XL/2 + REPA | 675M | 2.08 | 274.6 |
| MAR-L + MAR-VAE | 479M | 1.73 | 279.9 |
| MAR-B + l-DeTok (scratch, 400 ep) | 208M | 1.83 | 279.6 |
| MAR-L + l-DeTok (fine-tune, 200 ep) | 479M | 1.61 | 315.7 |
MS-COCO text-to-image (w/ CFG)
FID ↓ / CLIP ↑ scores for 256x256 text-to-image generation.
| Tokenizer | T2I MAR-B FID ↓ | T2I MAR-B CLIP ↑ | T2I SiT-B FID ↓ | T2I SiT-B CLIP ↑ |
|---|---|---|---|---|
| VA-VAE | 34.64 | 21.98 | 5.83 | 25.07 |
| SD-VAE | 19.75 | 22.86 | 6.63 | 24.00 |
| MAR-VAE | 12.49 | 22.07 | 5.74 | 23.50 |
| Ours (l-DeTok) | 4.97 | 24.82 | 4.31 | 24.61 |
Citation
If you find this work useful, please consider citing:
@article{yang2025detok,
title={Latent Denoising Makes Good Visual Tokenizers},
author={Yang, Jiawei and Li, Tianhong and Fan, Lijie and Tian, Yonglong and Wang, Yue},
journal={arXiv preprint arXiv:2507.15856},
year={2025}
}