l-DeTok

Latent Denoising Makes Good Tokenizers

A simple tokenizer trained to reconstruct clean images from heavily corrupted latent embeddings—aligning tokens with downstream denoising objectives and improving image generation across six representative models.

Jiawei Yang1, Tianhong Li2, Lijie Fan3, Yonglong Tian4, Yue Wang1

1University of Southern California    2MIT CSAIL    3Google DeepMind    4OpenAI

l‑DeTok teaser

Abstract

Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective—reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking—a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.

Method in 30 seconds

l-DeTok overview diagram
Figure 1. Our latent denoising tokenizers (l-DeTok) framework. During tokenizer training, we randomly mask input patches (masking noise) and interpolate encoder-produced latent embeddings with Gaussian noise (interpolative latent noise). The decoder processes these deconstructed latents and mask tokens to reconstruct the original images in pixels — a process we refer to as denoising. When serving as a tokenizer for downstream generative models, both noises are disabled.

Motivation: Why train tokenizers for denoising?

Modern diffusion and autoregressive models both learn to reconstruct clean signals from corrupted contexts (noise or masking). Tokenizers are usually optimized for pixel MSE, not for what generators actually learn. We ask what makes a tokenizer effective for generation. Our answer: make tokens easy to recover under the same kinds of degradations used by downstream models.

Unified denoising view

Non-Autoregressive (Non-AR) models remove injected noise; Autoregressive (AR) models fill in masked context. Both are reconstruction-from-deconstruction. We therefore train the tokenizer so that its latents remain reconstructable even when heavily corrupted.

l-DeTok in a nutshell

  • Design principle: Treat tokenization as reconstruction-from-deconstruction: deliberately corrupt latents, then learn to reconstruct the original image.
  • Interpolative latent noise: mix encoder latents with Gaussian noise at a random strength; the decoder learns to reconstruct original pixels.
  • Optional random masking: hide a random fraction of patches; provide shared [MASK] tokens to the decoder for context completion.

Why this improves generation

By training tokens to survive strong corruption, the tokenizer's latent embeddings match downstream training (noise removal or mask filling). This alignment eases optimization and yields better sample quality across both AR and non-AR families—without relying on large pretrained vision encoders.

Minimal recipe (You can implement this in less than 10 lines of code)

  • Corrupt latents with interpolative noise (and optional masking), then decode to original signals, e.g. pixels.

Interactive demo — Deconstruction & Reconstruction!

Each sample has a 21×21 grid: mask∈{0..1 by 0.05}, τ∈{0..1 by 0.05}. Sliders snap to these values.

Original Image
Optional Masking
Feat. map (clean)
Feat. map (noisy)
Reconstruction
baseline tokenizer
baseline original
baseline masked input
baseline feature map clean
baseline feature map noisy
baseline reconstruction
l-DeTok (ours)
ours original
ours masked input
ours feature map clean
ours feature map noisy
ours reconstruction

*Feature maps are visualized using PCA dimensionality reduction to 3 RGB components, computed from clean features (τ=0, mask ratio=0%) and applied consistently across all noise conditions. The 16×16 feature maps are upsampled to 256×256 using nearest-neighbor interpolation. Masked regions are dimmed to 30% intensity to indicate missing tokens.

Results

Plug-and-play across 6 models (vs SD-VAE)

Drop-in replacement for SD-VAE; no architecture changes. ImageNet 256x256. All models trained for 100 epochs.
FID-50K with optimal CFG. Lower FID is better.

Autoregressive (AR)
  • MAR-B: 4.64 → 2.43 (+47.6%)
  • RandomAR-B: 13.11 → 5.22 (+60.2%)
  • RasterAR-B: 8.26 → 4.46 (+46.0%)
Non-autoregressive (Diffusion)
  • SiT-B: 7.66 → 5.13 (+33.0%)
  • DiT-B: 8.33 → 6.58 (+21.0%)
  • LightningDiT-B: 4.24 → 3.63 (+14.4%)

Avg ΔFID: AR Models +51.3%, Non-AR Models +22.8%, overall +37.0%.

System-level (ImageNet)

256x256 (w/ CFG)
  • MAR-B + l-DeTok: FID 1.61 — new SoTA with 208M params.
  • MAR-L + l-DeTok: FID 1.35 — new SoTA with 479M params; beats MAR-H (1.55) and MAR-L (1.78) trained with MAR-VAE.
512x512 (w/ CFG)
  • MAR-L + l-DeTok (ft): FID 1.61 / IS 315.7 vs MAR-L + MAR-VAE (1.73 / 279.9).
  • MAR-B + l-DeTok (scratch): FID 1.83 / IS 279.6, competitive with larger diffusion systems.

Text-to-Image (MS-COCO 256x256, vs MAR-VAE)

  • MAR-B: FID 4.97 (+60%); CLIP 24.82 (+12.5%).
  • SiT-B: FID 4.31 (+25%); CLIP 24.61 (+4.7%).
  • Improves both diversity (FID↓) and alignment (CLIP↑) without external semantic distillation.

Parameter efficiency

  • MAR-L + l-DeTok hits FID 1.35 at 479M params — better FID than VAR-d30 (1.92 @ 2.0B) with ~4.2x fewer params.
  • MAR-B + l-DeTok delivers FID 1.61 at 208M — surpasses LlamaGen-3B (2.18 @ 3.1B) with ~15x fewer params.

Our l-DeTok serves as a drop-in tokenizer that unifies latent representations across generative paradigms—yielding more expressive tokens and better downstream generative performance.

Generalizability across 6 generative models

All numbers on ImageNet 256x256; lower is better. CFG tuned per model. 100 epochs.

Tokenizer rFID Autoregressive Models Non-autoregressive Models
MAR-B RandomAR-B RasterAR-B SiT-B DiT-B LightningDiT-B
Tokenizers trained with semantics distillation from external pretrained models
VA-VAE0.28 16.6638.1315.88 4.334.912.86
MAETok0.48 6.9924.8315.92 4.775.243.92
Our l-DeTok + Distillation0.85 2.525.5711.99 3.403.912.18
Tokenizers trained without semantics distillation
SD-VAE0.61 4.6413.118.26 7.668.334.24
MAR-VAE0.53 3.7111.787.99 6.268.203.98
Our l-DeTok0.68 2.435.224.46 5.136.583.63

Our l-DeTok generalizes better than prior tokenizers and benefits from privileged info like semantic distillation when available.

System-level comparison — ImageNet 256x256

w/o vs. w/ Classifier-Free Guidance (CFG).

Model #Params w/o CFG w/ CFG
FID ↓IS ↑ FID ↓IS ↑
With semantics distillation from external pretrained models
SiT-XL + REPA675M5.90157.81.42305.7
SiT-XL + MAETok675M2.31216.51.67311.2
LightningDiT + MAETok675M2.21208.31.73308.4
LightningDiT + VAVAE675M2.17205.61.35295.3
DDT-XL675M6.27154.71.26310.6
Without semantics distillation
DiT-XL/2675M9.62121.52.27278.2
SiT-XL/2675M8.302.06270.3
VAR-d302.0B1.92323.1
LlamaGen-3B3.1B2.18263.3
RandAR-XXL1.4B2.15322.0
CausalFusion676M3.61180.91.77282.3
MAR-B + MAR-VAE208M3.48192.42.31281.7
MAR-L + MAR-VAE479M2.60221.41.78296.0
MAR-H + MAR-VAE943M2.35227.81.55303.7
MAR-B + l-DeTok208M2.79195.91.61289.7
MAR-B + l-DeTok208M2.94195.51.55291.0
MAR-L + l-DeTok479M1.84238.41.43303.5
MAR-L + l-DeTok479M1.86238.61.35304.1

† With additional decoder fine-tuning.

ImageNet 512x512 (w/ CFG)

All numbers reported at 512x512 resolution.

Model #Params FID ↓ IS ↑
ADM554M7.72172.7
DiT-XL/2675M3.04240.8
SiT-XL/2675M2.62252.2
SiT-XL/2 + REPA675M2.08274.6
MAR-L + MAR-VAE479M1.73279.9
MAR-B + l-DeTok (scratch, 400 ep) 208M1.83279.6
MAR-L + l-DeTok (fine-tune, 200 ep) 479M1.61315.7

MS-COCO text-to-image (w/ CFG)

FID ↓ / CLIP ↑ scores for 256x256 text-to-image generation.

Tokenizer T2I MAR-B FID ↓ T2I MAR-B CLIP ↑ T2I SiT-B FID ↓ T2I SiT-B CLIP ↑
VA-VAE34.6421.985.8325.07
SD-VAE19.7522.866.6324.00
MAR-VAE12.4922.075.7423.50
Ours (l-DeTok) 4.9724.82 4.3124.61

Citation

If you find this work useful, please consider citing:

@article{yang2025detok,
  title={Latent Denoising Makes Good Visual Tokenizers},
  author={Yang, Jiawei and Li, Tianhong and Fan, Lijie and Tian, Yonglong and Wang, Yue},
  journal={arXiv preprint arXiv:2507.15856},
  year={2025}
}