DVT: Denoising Vision Transformers

Method recap:

Our decomposition relies on this approximation:

ViT(x) ≈ F(x) + [G(position) + h(x, position)],

where: F(x) represents the denoised semantic features, G(position) denotes the shared artifacts across all views, and h(x, position) models the interdependency between position and semantic content.

Additional per-image denoising examples (EVA02)

Our method works for feature maps extracted under different stride size.

Timm model id: `eva02_base_patch16_clip_224.merged2b`
*Please note that the clustering package we use sometimes fails to produce meaningful clusters.

EVA02 VIT-Base (stride 8)

EVA02 VIT-Base (stride 16)


                    @article{yang2024denoising,

                      author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},

                      title  = {DVT: Denoising Vision Transformers},

                      journal = {arXiv preprint arXiv:2401.02957},

                      year   = {2024},

                }

DVT: Denoising Vision Transformers

Jiawei Yang^†,*,1

Katie Z Luo^*,2

Jiefeng Li³

Congyue Deng⁴

Leonidas J. Guibas⁴

Dilip Krishnan⁵

Kilian Q. Weinberger²

Yonglong Tian⁵

Yue Wang¹

¹University of Southern California

²Cornell University

³Shanghai Jiaotong University

⁴Stanford University

⁵Google DeepMind

^†project lead

^*equal contribution

Method recap:

Additional per-image denoising examples (EVA02)

EVA02 VIT-Base (stride 8)

EVA02 VIT-Base (stride 16)

Citation