From MOCO v1 to v3: Towards Building a Dynamic Dictionary for Self-Supervised Learning — Part 1 | by Mengliu Zhao

A gentle recap on the momentum contrast learning framework

Have we reached the era of self-supervised learning?

Data is flowing in every day. People are working 24/7. Jobs are distributed to every corner of the world. But still, so much data is left unannotated, waiting for the possible use by a new model, a new training, or a new upgrade.

Or, it will never happen. It will never happen when the world is running in a supervised fashion.

The rise of self-supervised learning in recent years has unveiled a new direction. Instead of creating annotations for all tasks, self-supervised learning breaks tasks into pretext/pre-training (see my previous post on pre-training here) tasks and downstream tasks. The pretext tasks focus on extracting representative features from the whole dataset without the guidance of any ground truth annotations. Still, this task requires labels generated automatically from the dataset, usually by extensive data augmentation. Hence, we use the terminologies unsupervised learning (dataset is unannotated) and self-supervised learning (tasks are supervised by self-generated labels) interchangeably in this article.

Contrastive learning is a major category of self-supervised learning. It uses unlabelled datasets and contrastive information-encoded losses (e.g., contrastive loss, InfoNCE loss, triplet loss, etc.) to train the deep learning network. Major contrastive learning includes SimCLR, SimSiam, and the MOCO series.

MOCO — the word is an abbreviation for “momentum contrast.” The core idea was written in the first MOCO paper, suggesting the understanding of a computer vision self-supervised learning problem, as follows:

“[quote from original paper] Computer vision, in contrast, further concerns dictionary building, as the raw signal is in a continuous, high-dimensional space and is not structured for human communication… Though driven by various motivations, these (note: recent visual representation learning) methods can be thought of as building dynamic dictionaries… Unsupervised learning trains encoders to perform dictionary look-up: an encoded ‘query’ should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss.”

In this article, we’ll do a gentle review of MOCO v1 to v3:

v1 — the paper “Momentum contrast for unsupervised visual representation learning” was published in CVPR 2020. The paper proposes a momentum update to key ResNet encoders using sample queues with InfoNCE loss.
v2 — the paper “ Improved baselines with momentum contrastive learning” came out immediately after, implementing two SimCLR architecture improvements: a) replacing the FC layer with a 2-layer MLP and b) extending the original data augmentation by including blur.
v3 — the paper “An empirical study of training self-supervised vision transformers” was published in ICCV 2021. The framework extends the key-query pair to two key-query pairs, which were used to form a SimSiam-style symmetric contrastive loss. The backbone also got extended from ResNet-only to both ResNet and ViT.