DINO — A Foundation Model for Computer Vision | by Sascha Kirch | Sep, 2023

The DINO framework shares the same overall structure with other similarity-learning frameworks like BYOL or the mean teacher but also with knowledge distillation. Let’s first have a look on how DINO does it and the differentiate between the other frameworks.

Fig. 2: DINO architecture. Source + annotations by Sascha Kirch

Networks and Update Rule

Let’s start from the middle. DINO implements two networks with the exact same architecture but a different set of weights. Those are the student and the teacher. The student is trained with back propagation and the teacher updates its weights with an exponential moving average of its own weights and those of the student.

Equation 1: Update rule of the teacher’s weights. Source + annotations by Sascha Kirch

Backbones are either a ResNet50 or DeiT (which is a ViT adapted for knowledge distillation). An MLP-based projection head is connected to the backbone to reduce the dimensionality of the features, but is removed for inference.

Nice, but which model is used for inference: student or teacher? — Well that’s a good question and funny enough not a single word is mentioned in the paper. Intuitively you might think the student, at least I did at first. But as we will see later, the teacher outperforms the student throughout the training. The only hint beside the better performance is that in the code implementation the teacher checkpoint is the default one for the evaluation of for example video segmentation, linear probing and k-NN. Since this parameter can be changed though, I cannot tell you with certainty.

Inputs and Outputs

From an input image x different views x1 and x2 are created by cropping and applying image augmentations like in BYOL (e.g. color jitter, Gaussian blur and solarization). The technique used for cropping is called multi-crop where multiple crops of different sizes are generated to save memory while providing more data. Small crops are called local views and consist of 96×96 pixels that are exclusively feed into the student. Larger crops are called global views and consists of 224×224 pixels that are exclusively fed into the teacher. As we will see later in the ablation section, 2 global views and 10 local views have been used during training.

NOTE: The paper is a bit confusing regarding the multi-crop technique because neither the provided pseudo-code nor the architecture shown in Fig. 3 above reflect it. The pseudo code even suggests that x1 and x2 are feed into both, the student and the teacher like in BYOL, which is not the case when using multi-crop.

In contrast to similarity learning where the objective is to maximize the similarity of embeddings, DINO minimizes the cross-entropy between the teacher’s and the student’s output distribution. As indicated by the equation bellow, the cross-entropy is calculated for each pair of global and local views and is then summed up.

Equation 2: Optimization objective. Source + annotations by Sascha Kirch

And what do the models output? — Like in similarity learning, the student and the teacher output an embedding for a given image, rather than a prediction score. Like in knowledge distillation, the output is transformed via a SoftMax transformation into a probability distribution. The SoftMax has a temperature parameter that controls the smoothing or sharpening of the resulting distribution. This temperature plays a crucial role in knowledge distillation because it allows to control the balance between transferring general knowledge and fine-grained details from a teacher network to a student network, making the distillation process more effective for different tasks.

Fig. 3: Effect of temperature value on the SoftMax output. Illustration by Sascha Kirch created with this python notebook

I created a notebook for you so you can investigate the impact of the temperature on the resulting distribution:

Avoiding Collapse

As mentioned earlier, student and teacher have the exact same architecture. This kind of setup is unstable (if no counter measures are implemented) and might result in collapsing solutions, where all features are mapped to a certain region in the latent space, e.g. a single point in the worst case. BYOL addressed this issue with an extra prediction head for only one of the models introducing an asymmetry. Since DINO has symmetric models another trick is required: centering and sharpening. Both are applied to the teacher network only. Centering is a technique that prevents a single dimension in the latent space to dominate, by adding a bias term c to the teachers output g(x) = g(x)+c, where

Equation 3: Update rule of the centering term. Source + annotations by Sascha Kirch

While centering has a positive effect, it also encourages the output to collapse into a uniform distribution. Sharpening has the opposite effect hence applying both balances their effect and stabilizes training. Sharpening is achieved by using a smaller temperature in the SoftMax (see Fig. 3) for the teacher as for the student.

To avoid collapsing the hyperparameter m from equation 3 and the temperature of the teacher are crucial. In their ablation study in the appendix section the authors show that m=0.9…0.999 works best and the temperature value is linearly increased from 0.04 to 0.07 during warm-up.

What does DINO do? Knowledge Distillation or Similarity Learning?

The answer is a little bit of both!

While knowledge distillation usually distils knowledge from an already trained, larger and more accurate teacher model into a smaller student model, it could also be seen as some sort of similarity learning because it encourages the student network to produce predictions that are similar to those of the teacher. In similarity learning, the two models are usually trained jointly and often align their latent space predictions rather than probability distributions.

Since the authors of DINO phrase their objective as knowledge distillation, let’s have a look on some differences compared with “standard” knowledge distillation:

  1. DINO’s teacher is not available a priori but “trained” alongside the student. It can even be considered as a co-distillation since knowledge is also distilled from the student into the teacher.
  2. DINO’s teacher and student are not acting on the same input but on different views of the image cropped to different sizes.
  3. DINO uses different temperatures in the SoftMax of both models to perform sharpening.
  4. DINO calculates the cross-entropy over the temperature-scaled SoftMax of the embeddings rather than prediction scores.

And how is it similar to knowledge distillation?:

  1. DINO consists of a student and a teacher network, where the teacher performs better than the student as we will see in the experiments.
  2. Rather than maximizing a similarity metric, DINO minimizes the cross-entropy loss of a temperature scaled SoftMax output.

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.