DVT: Denoising Vision Transformers

Project Paper Arxiv Code (to be updated) Video Gallery Image Gallery

Method recap:

Our decomposition relies on this approximation:

ViT(x) ≈ F(x) + [G(position) + h(x, position)],

where: F(x) represents the denoised semantic features, G(position) denotes the shared artifacts across all views, and h(x, position) models the interdependency between position and semantic content.


Additional per-image denoising examples (CLIP)

Our method works for feature maps extracted under different stride size.


Timm model id: `vit_base_patch16_clip_384.laion2b_ft_in12k_in1k`
*Please note that the clustering package we use sometimes fails to produce meaningful clusters.

CLIP ViT-Base (stride 8)

CLIP ViT-Base (stride 16)


Citation

@article{yang2024denoising,
  author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  title = {DVT: Denoising Vision Transformers},
  journal = {arXiv preprint arXiv:2401.02957},
  year = {2024},
}