DVT: Denoising Vision Transformers

Denoised DINOv2

Emerged Properties from denoised DINOv2

We empirically identify several properties that may be hidden in the original DINOv2 model but are revealed in our denoised version.

First, we collect features from randomly sampled images from the DAVIS dataset and compute a global PCA reduction matrix to map high-dimensional features to three-dimensional color values. This reduction matrix is then applied to the features of the denoised DINOv2 model.

For visualizations of the PCA results, refer to (h) for 3-channel visualizations and to (c), (d), and (e) for individual PCA components. These components highlight different aspects of the image, with the second component capturing the object's prominence. We thus use it as a foreground mask to separate the object from the background and visualize the foreground's PCA feature map separately in (b). This foreground disentanglement is not observed in the first three principal components of the original DINOv2 model. Following this, we examine up to 10 PCA components and find the 5th component to be the most informative, though still noisy, for the object (check out this). It is arguably unfair for us to directly compare against these noisy foreground objects. Therefore, we use the officially provided `standard_array` to compute the foreground mask for the original DINOv2 model. However, it remains unclear how this `standard_array` is generated. We also provide the foreground PCA results by relying on the 5th PCA component for reference.

Additionally, as discussed in the paper, the feature norm of the denoised features can also serve as an object indicator. We visualize these feature norms in (h), noting how they complement the PCA results. Lastly, we visualize the KMeans clustering results of the denoised features in (g) to show how the features are clustered in the feature space.

The same procedure is applied to the original DINOv2 model for comparison. We find that the top 3 PCA components of the original DINOv2 model do not capture the object's prominence as well as the denoised DINOv2 model. They mainly indicate the positional patterns. The feature norm of the original DINOv2 model also does not serve as a good object indicator. The KMeans clustering results of the original DINOv2 model are adversely affected by the noise in the features.

*All results are derived from our generalizable denoiser (to be released) applied to unseen frames during denoiser training.
*Best viewed in Chrome.