DVT: Denoising Vision Transformers

Jiawei Yang†,*
University of Southern California
Katie Z Luo*
Cornell University
Jiefeng Li
Shanghai Jiao Tong University
Kilian Q. Weinberger
Cornell University
Yonglong Tian
Google Research
Yue Wang
University of Southern California
Nvidia Research

*equal technical contribution project lead
Teaser Image

Abstract

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean ViT features for offline applications. Furthermore, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, capable of generalizing to unseen data without the need for per-image optimization. Our two-stage approach, which we term as Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs, and is immediately applicable to any Transformer-based architectures. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.

TL;DR:

We identify crucial artifacts in ViTs caused by positional embeddings and propose a two-stage approach to remove these artifacts, which significantly improves the feature quality of different pre-trained ViTs.

Example denoised results

Despite the significant strides made by ViTs, our work reveals a crucial yet often overlooked challenge: the presence of persistent noise artifacts in ViT outputs, observable across various training algorithms. These artifacts, beyond being visually annoying, hinder feature interpretability and disrupt semantic coherence. For example, the below examples demonstrate that applying clustering algorithms directly on the raw ViT outputs results in noisy clusters.

Our proposed DVT effectively removes these artifacts, using a streamlined and generalizable denoiser. See our paper for more details.

*All results showcased are derived from our generalizable denoiser applied to DINOv2 ViT-base outputs.
*None of the videos is seen during training.
*Best viewed in Chrome.



More results

DVT denoises DINOv2 models trained with register tokens

Appending register tokens to the input of ViT is a recently proposed method aimed at removing artifacts in ViTs and enhancing their performance. However, we have found that models trained with registers continue to exhibit certain artifacts, albeit to a lesser extent. Our method is capable of further reducing these artifacts.

*None of the videos is seen during training.
*Best viewed in Chrome.



DVT denoises different pre-trained ViTs

Our method is generalizable to a variety of ViTs. See below.

*None of the videos is seen during training.
*Best viewed in Chrome.


Citation

@article{yang2024denoising,
  author = {Jiawei Yang and Katie Z Luo and Jiefeng Li and Kilian Q Weinberger and Yonglong Tian and Yue Wang},
  title = {Denoising Vision Transformers},
  journal = {arXiv preprint arXiv:2401.02957},
  year = {2024},
}

Acknowledgement

We are grateful to many friends, including Congyue Deng, Jiageng Mao, Justin Lovelace, Varsha Kishore, Christian Belardi, and Junjie Ye, for their fruitful discussions on this work and follow-ups. We acknowledge an unrestricted gift from Google in support of this project.


The website template was borrowed from FreeNeRF and ZipNeRF.