Latent Denoising Makes Good Tokenizers (l-DeTok)

Abstract

Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective—reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking—a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.

Method in 30 seconds

l-DeTok overview diagram — **Figure 1.** Our latent denoising tokenizers (l-DeTok) framework. During tokenizer training, we randomly mask input patches (*masking noise*) and interpolate encoder-produced latent embeddings with Gaussian noise (*interpolative latent noise*). The decoder processes these deconstructed latents and mask tokens to reconstruct the original images in pixels — a process we refer to as *denoising*. When serving as a tokenizer for downstream generative models, both noises are disabled.

Motivation: Why train tokenizers for denoising?

Modern diffusion and autoregressive models both learn to reconstruct clean signals from corrupted contexts (noise or masking). Tokenizers are usually optimized for pixel MSE, not for what generators actually learn. We ask what makes a tokenizer effective for generation. Our answer: make tokens easy to recover under the same kinds of degradations used by downstream models.

Unified denoising view

Non-Autoregressive (Non-AR) models remove injected noise; Autoregressive (AR) models fill in masked context. Both are reconstruction-from-deconstruction. We therefore train the tokenizer so that its latents remain reconstructable even when heavily corrupted.

l-DeTok in a nutshell

Design principle: Treat tokenization as reconstruction-from-deconstruction: deliberately corrupt latents, then learn to reconstruct the original image.
Interpolative latent noise: mix encoder latents with Gaussian noise at a random strength; the decoder learns to reconstruct original pixels.
Optional random masking: hide a random fraction of patches; provide shared [MASK] tokens to the decoder for context completion.

Why this improves generation

By training tokens to survive strong corruption, the tokenizer's latent embeddings match downstream training (noise removal or mask filling). This alignment eases optimization and yields better sample quality across both AR and non-AR families—without relying on large pretrained vision encoders.

Minimal recipe (You can implement this in less than 10 lines of code)

Corrupt latents with interpolative noise (and optional masking), then decode to original signals, e.g. pixels.

Interactive demo — Deconstruction & Reconstruction!

Noise Level τ 0.50 Masking Ratio 0%

Each sample has a 21×21 grid: mask∈{0..1 by 0.05}, τ∈{0..1 by 0.05}. Sliders snap to these values.

Original Image

Optional Masking

Feat. map (clean)

Feat. map (noisy)

Reconstruction

baseline tokenizer

l-DeTok (ours)

^*Feature maps are visualized using PCA dimensionality reduction to 3 RGB components, computed from clean features (τ=0, mask ratio=0%) and applied consistently across all noise conditions. The 16×16 feature maps are upsampled to 256×256 using nearest-neighbor interpolation. Masked regions are dimmed to 30% intensity to indicate missing tokens.

Results

Plug-and-play across 6 models (vs SD-VAE)

Drop-in replacement for SD-VAE; no architecture changes. ImageNet 256x256. All models trained for 100 epochs.
FID-50K with optimal CFG. Lower FID is better.

Autoregressive (AR)

MAR-B: 4.64 → 2.43 (+47.6%)
RandomAR-B: 13.11 → 5.22 (+60.2%)
RasterAR-B: 8.26 → 4.46 (+46.0%)

Non-autoregressive (Diffusion)

SiT-B: 7.66 → 5.13 (+33.0%)
DiT-B: 8.33 → 6.58 (+21.0%)
LightningDiT-B: 4.24 → 3.63 (+14.4%)

Avg ΔFID: AR Models +51.3%, Non-AR Models +22.8%, overall +37.0%.

System-level (ImageNet)

256x256 (w/ CFG)

MAR-B + l-DeTok: FID 1.61 — new SoTA with 208M params.
MAR-L + l-DeTok: FID 1.35 — new SoTA with 479M params; beats MAR-H (1.55) and MAR-L (1.78) trained with MAR-VAE.

512x512 (w/ CFG)

MAR-L + l-DeTok (ft): FID 1.61 / IS 315.7 vs MAR-L + MAR-VAE (1.73 / 279.9).
MAR-B + l-DeTok (scratch): FID 1.83 / IS 279.6, competitive with larger diffusion systems.

Text-to-Image (MS-COCO 256x256, vs MAR-VAE)

MAR-B: FID 4.97 (+60%); CLIP 24.82 (+12.5%).
SiT-B: FID 4.31 (+25%); CLIP 24.61 (+4.7%).
Improves both diversity (FID↓) and alignment (CLIP↑) without external semantic distillation.

Parameter efficiency

MAR-L + l-DeTok hits FID 1.35 at 479M params — better FID than VAR-d30 (1.92 @ 2.0B) with ~4.2x fewer params.
MAR-B + l-DeTok delivers FID 1.61 at 208M — surpasses LlamaGen-3B (2.18 @ 3.1B) with ~15x fewer params.

Our l-DeTok serves as a drop-in tokenizer that unifies latent representations across generative paradigms—yielding more expressive tokens and better downstream generative performance.

Generalizability across 6 generative models

All numbers on ImageNet 256x256; lower is better. CFG tuned per model. 100 epochs.

Tokenizer	rFID	Autoregressive Models			Non-autoregressive Models
		MAR-B	RandomAR-B	RasterAR-B	SiT-B	DiT-B	LightningDiT-B
Tokenizers trained with semantics distillation from external pretrained models
VA-VAE	0.28	16.66	38.13	15.88	4.33	4.91	2.86
MAETok	0.48	6.99	24.83	15.92	4.77	5.24	3.92
Our l-DeTok + Distillation	0.85	2.52	5.57	11.99	3.40	3.91	2.18
Tokenizers trained without* semantics distillation*
SD-VAE	0.61	4.64	13.11	8.26	7.66	8.33	4.24
MAR-VAE	0.53	3.71	11.78	7.99	6.26	8.20	3.98
Our l-DeTok	0.68	2.43	5.22	4.46	5.13	6.58	3.63

Our l-DeTok generalizes better than prior tokenizers and benefits from privileged info like semantic distillation when available.

System-level comparison — ImageNet 256x256

w/o vs. w/ Classifier-Free Guidance (CFG).

Model	#Params	w/o CFG		w/ CFG
		FID ↓	IS ↑	FID ↓	IS ↑
With semantics distillation from external pretrained models
SiT-XL + REPA	675M	5.90	157.8	1.42	305.7
SiT-XL + MAETok	675M	2.31	216.5	1.67	311.2
LightningDiT + MAETok	675M	2.21	208.3	1.73	308.4
LightningDiT + VAVAE	675M	2.17	205.6	1.35	295.3
DDT-XL	675M	6.27	154.7	1.26	310.6
Without semantics distillation
DiT-XL/2	675M	9.62	121.5	2.27	278.2
SiT-XL/2	675M	8.30	–	2.06	270.3
VAR-d30	2.0B	–	–	1.92	323.1
LlamaGen-3B	3.1B	–	–	2.18	263.3
RandAR-XXL	1.4B	–	–	2.15	322.0
CausalFusion	676M	3.61	180.9	1.77	282.3
MAR-B + MAR-VAE	208M	3.48	192.4	2.31	281.7
MAR-L + MAR-VAE	479M	2.60	221.4	1.78	296.0
MAR-H + MAR-VAE	943M	2.35	227.8	1.55	303.7
MAR-B + l-DeTok	208M	2.79	195.9	1.61	289.7
MAR-B + l-DeTok^†	208M	2.94	195.5	1.55	291.0
MAR-L + l-DeTok	479M	1.84	238.4	1.43	303.5
MAR-L + l-DeTok^†	479M	1.86	238.6	1.35	304.1

† With additional decoder fine-tuning.

ImageNet 512x512 (w/ CFG)

All numbers reported at 512x512 resolution.

Model	#Params	FID ↓	IS ↑
ADM	554M	7.72	172.7
DiT-XL/2	675M	3.04	240.8
SiT-XL/2	675M	2.62	252.2
SiT-XL/2 + REPA	675M	2.08	274.6
MAR-L + MAR-VAE	479M	1.73	279.9
MAR-B + l-DeTok (scratch, 400 ep)	208M	1.83	279.6
MAR-L + l-DeTok (fine-tune, 200 ep)	479M	1.61	315.7

MS-COCO text-to-image (w/ CFG)

FID ↓ / CLIP ↑ scores for 256x256 text-to-image generation.

Tokenizer	T2I MAR-B FID ↓	T2I MAR-B CLIP ↑	T2I SiT-B FID ↓	T2I SiT-B CLIP ↑
VA-VAE	34.64	21.98	5.83	25.07
SD-VAE	19.75	22.86	6.63	24.00
MAR-VAE	12.49	22.07	5.74	23.50
Ours (l-DeTok)	4.97	24.82	4.31	24.61

Citation

If you find this work useful, please consider citing:

@article{yang2025detok,
  title={Latent Denoising Makes Good Visual Tokenizers},
  author={Yang, Jiawei and Li, Tianhong and Fan, Lijie and Tian, Yonglong and Wang, Yue},
  journal={arXiv preprint arXiv:2507.15856},
  year={2025}
}