Deep Dive in Transformers for Object Detection (DETR)


Deep Dive and clear explanations on the paper “End to end detection with transformers”

Photo by Aditya Vyas on Unsplash

Note: This article delves into the intricate world of Computer Vision, specifically focusing on Transformers and the Attention Mechanism. It’s recommended to be acquainted with the key concepts from the paper “Attention is All You Need.”

DETR, short for DEtection TRansformer, pioneered a novel wave in object detection upon its conception by Nicolas Carion and team at Facebook AI Research in 2020.

While not currently holding the SOTA (State Of The Art) status, DETR’s innovative reformulation of object detection tasks has significantly influenced subsequent models, such as CO-DETR, which is the current State-of-the-Art in Object Detection and Instance Segmentation.

Moving away from the conventional one-to-many problem scenario, where each ground truth corresponds to myriad anchor candidates, DETR introduces a fresh perspective by viewing object detection as a set prediction problem, with a one-to-one correspondance between predictions and ground truth, thereby eliminating the need for certain post-processing techniques.

Object detection is a domain of computer vision that focuses on identifying and locating objects within images or video frames. Beyond merely classifying an object, it provides a bounding box, indicating the object’s location in the image, thereby enabling systems to understand the spatial context and positioning of various identified objects.

Yolo5 video segmentation, source

Object detection is really useful in itself for example in autonomous driving, but it is also a preliminary task for instance segmentation, where we try to seek a more precise contour of the objects, while being able to differentiate betwen different instances (unlike semantic segmentation).

Non-Maximum Suppression, Image by author

Non-Maximum Suppression (NMS) has long been the cornerstone in object detection algorithms, performing an indispensable role in post-processing to refine the prediction outputs. In traditional object detection frameworks, the model proposes a plethora of bounding boxes around potential object regions, some of which invariably exhibit substantial overlap (as seen on the picture above).

NMS addresses this by preserving the bounding box with the maximum predicted objectness score while concurrently suppressing the neighboring boxes that manifest a high degree of overlap, quantified by the Intersection over Union (IoU) metric. Specifically, given a pre-established IoU threshold, NMS iteratively selects the bounding box with the highest confidence score and nullifies those with IoU exceeding this threshold, ensuring a singular, highly confident prediction per object

Despite its ubiquity, DETR (DEtection TRansformer) audaciously sidesteps the conventional NMS, reinventing object detection by formulating it as a set prediction problem.

By leveraging transformers, DETR directly predicts a fixed-size set of bounding boxes and obviates the necessity for the traditional NMS, remarkably simplifying the object detection pipeline while preserving, if not enhancing, model performance.

At the High-Level picture, DETR is

  • An Image Encoder (it is actually a double Image Encoder because there is firstly a CNN Backbone followed by a Transformer Encoder for more expressivity)
  • A Transformer Decoder which produces the bounding boxes from the image encoding.
DETR architecture, image from article

Let’s go into more details in each part:

  1. Backbone:

We start from an initial image with 3 color channels:

And this image is fed into a backbone which is a Convolutional Neural Network

Typical values we use are C = 2048 and H = W = H0 =W0 = 32

2. Transformer encoder:

The Transformer Encoder is theoretically not mandatory but it adds more expressivity to the backbone, and ablation studies show improved performance.

First, a 1×1 convolution reduces the channel dimension of the high-level activation map f from C to a smaller dimension d.

After the Convolution

But as you know, Transformers map an input sequence of vectors to an output sequence of vectors, so we need to collapse the spatial dimension:

After collapsing the spatial dimension

Now we are ready to feed this to the Transformer Encoder.

It is important to note that the Transformer Encoder only uses self-attention mechanism.

That’s it for the Image Encoding part!

More details on the Decoder, Image from article

3. Transformer decoder:

This part is the most difficult part to understand, hang on, if you understand this, you understand most of the article.

The Decoder uses a combination of Self-Attention and Cross-Attention mechanisms. It is fed N object queries, and each query will be transformed into an output box and class.

What does a box prediction look like?

It is actually made of 2 components.

  • A bounding box has some coordinates (x1, y1, x2, y2) to identify the bounding box.
  • A class (for example seagul, but it can also be empty)

It is important to note that N is fixed. It means that DETR always predict exactly N bounding boxes. But some of them can be empty. We just have to make sure that N is big enough to cover enough objects in the images.

Then the Cross-Attention Mechanism can attend the image features produced by the Encoding part (Backbone + Transformer Encoder).

If you are unsure about the mechanism, this scheme should clarify things:

Detailed Attention mechanism, image from article

In the original Transformer architecture we produce one token and then we use a combination of self-attention and cross-attention to produce the next token and repeat. But here we do not need this recurrence formulation, we can just produce all the outputs at one so we can exploit parallelism.

As mentioned before, DETR produces exactly N outputs (bboxes + class). But each output corresponds to one ground truth only. If you remember well this is the whole point, we do not want to apply any post-processing to filter out overlapping boxes.

Bipartite Matching, Image by author

We want to basically associate each prediction with the closest ground truth. So we are in fact looking for a bijection between the prediction set and the ground truth set, which minimizes a total loss.

So how do we define this loss?

1. The matching loss (pairwise loss)

We first need to define a matching loss, which corresponds to the loss between one prediction box and one ground truth box:

And this loss is split needs to account for 2 components:

  • The classification loss (is the class predicted inside the bounding box the same as the ground truth)
  • The bounding box loss (is the bounding box close to the ground truth)
Matching Loss

And more precisely for the bounding box component there are 2 sub components:

  • Intersection over Union Loss (IOU)
  • L1 Loss (absolute difference between coordinates)

2. Total Loss of the Bijection

To compute the total loss, we just sum over the N instances:

So basically our problem is to find the bijection that minimizes the total loss:

Reformulation of the problem
Performance of DETR vs Faster RCNN
  • DETR: This refers to the original model, which uses a transformer for object detection and a ResNet-50 as a backbone.
  • DETR-R101: This is a variant of DETR that employs a ResNet-101 backbone instead of ResNet-50. Here, “R101” refers to “ResNet-101”.
  • DETR-DC5: This version of DETR uses the modified, dilated C5 stage in its ResNet-50 backbone, improving the model’s performance on smaller objects due to the increased feature resolution.
  • DETR-DC5-R101: This variant combines both modifications. It uses a ResNet-101 backbone and includes the dilated C5 stage, benefiting from both the deeper network and the increased feature resolution.

DETR significantly outperforms baselines on large objects, which is very likely enabled by the non-local computations allowed by the transformer. But interesting enough DETR achieves lower performances on small objects.

Attention on Overlapping instances, image from article

Very interestingly, we can observe that in the case of Overlapping instances, Attention mechanism is able to correctly separate individual instances as shown on the picture above.

Attention mechanism on extremities

It is also very interesting no note that Attention is focused on the extremities of objects to produce the bounding box, which is exactly what we expect.

DETR is not merely a model; it is a paradigm shift, transforming object detection from a one-to-many problem into a set prediction problem, effectively utilizing Transformer architecture advancements.

Enhancements have unfolded since its inception, with models like DETR++ and CO-DETR now steering the ship as State of the Art in Instance Segmentation and Object Detection on the COCO dataset.

Thanks for reading! Before you go:

You should get my articles in your inbox. Subscribe here.

If you want to have access to premium articles on Medium, you only need a membership for $5 a month. If you sign up with my link, you support me with a part of your fee without additional costs.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*