DVT: Denoising Vision Transformers

Method recap:

Our decomposition relies on this approximation:

ViT(x) ≈ F(x) + [G(position) + h(x, position)],

where: F(x) represents the denoised semantic features, G(position) denotes the shared artifacts across all views, and h(x, position) models the interdependency between position and semantic content.

Additional per-image denoising examples (DINOv2)

Our method works for feature maps extracted under different stride size.

Timm model id: `vit_base_patch14_dinov2.lvd142m`
*Please note that the clustering package we use sometimes fails to produce meaningful clusters.

DINOv2 ViT-Base (stride 7)

DINOv2 ViT-Base (stride 14)


                    @article{yang2024denoising,

                      author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},

                      title  = {DVT: Denoising Vision Transformers},

                      journal = {arXiv preprint arXiv:2401.02957},

                      year   = {2024},

                }

DVT: Denoising Vision Transformers

Jiawei Yang^†,*,1

Katie Z Luo^*,2

Jiefeng Li³

Congyue Deng⁴

Leonidas J. Guibas⁴

Dilip Krishnan⁵

Kilian Q. Weinberger²

Yonglong Tian⁵

Yue Wang¹

¹University of Southern California

²Cornell University

³Shanghai Jiaotong University

⁴Stanford University

⁵Google DeepMind

^†project lead

^*equal contribution

Method recap:

Additional per-image denoising examples (DINOv2)

DINOv2 ViT-Base (stride 7)

DINOv2 ViT-Base (stride 14)

Citation