In this post we will take an alternative look at RAFT. The head on approach in part 1 was able to break down the details of the network, but here we will visualize these details and build valuable intuition. In part 1 we aimed to understand RAFT so that we can use it as is; in part 2 we will aim to understand RAFT in a manner that allows us to leverage different parts of its architecture for our own models.

Here’s an overview of the post:

**Motivation****RAFT Architecture****The Lookup Operator****Iterative Updates****Conclusion**

The concepts from RAFT are utilized in many subsequent works, and understanding RAFT is key to understanding these new approaches. How can you know which parts of RAFT can or should be leveraged? Why do many subsequent works make use of the correlation volume? Answers to these questions come from grasping the inner workings of RAFT, it’s not always enough to understand what a paper presents at face value, sometimes we need to go deeper and RAFT is no exception.

To begin let’s get a quick refresher on RAFT, its architecture can be broken down into its three fundamental blocks and it is shown below.

## Feature Extraction

The **Feature Encoder** is a Convolutional Neural Network (CNN) that extracts features from images *I₁* and *I₂* using shared weights. The **Context encoder** extracts both context and hidden features, both of which are input into the iterative update block.

## Visual Similarity

The dot product of the two feature maps forms a 4D **all-pairs correlation volume**, where each pixel of *g¹* maps to all pixels of *g²*, each of these mappings is referred to as a **2D response map**.* Where g¹ and g² are extracted feature map tensors from I₁ and I₂, respectively. *Average Pooling is performed over the last two dimensions of the correlation volume with kernel sizes: 1, 2, 4, 8.

We stack these correlation volumes into a 5D **correlation pyramid **with each level relating the fine pixels of *g¹* to the increasingly coarse pixel features of *g²*. This allows us to capture information about both large and small pixel displacements. The **lookup operator** extracts correlation features from the correlation volume. It takes a fine feature pixel of *g¹ *along with its corresponding optical flow field and computes its new apparent location, this is known as its **correspondence**. It then forms a 2D grid around it with a predefined radius *r*, and subsequently performs subpixel bilinear resampling along the grid to get new grid values. This resampled feature grid contains flow (descent) direction information, the lookup operator does this for each pixel in each layer of the pyramid. These pixel wise feature grids are referred to as **correlation features** which are then reshaped and input into the **iterative update block**.

## Iterative Updates

The iterative update block takes four inputs: context features, correlation features, current flow estimate, and hidden features. The flow and correlation features are encoded together as motion features since they both describe the relative motion of the feature pixels. The context features do not change, they function as a stable reference for the update block to use. The block itself consists of a ConvGRU that computes a set amount of recurrent updates followed by a Flow Head that consists of convolutional layers to convert the hidden state to a flow estimate at 1/8 the original input resolution.

The update operator functions like an Optimization Algorithm, meaning that it starts from an initial flow ** f₀ **and iteratively computes new flow values

*Δfₖ*that are added to the previous flow estimate:

*fₖ₊₁ = fₖ + Δfₖ*until it converges to a fixed value:

**. As we perform the iterative updates, both the flow estimate and the correlation features (not the correlation pyramid) are continuously being refined. Once the iterations are exhausted the flow estimate is upsampled from 1/8 to the original resolution.**

*fₖ → f**## Convex Upsampling

**Convex Upsampling** estimates each fine pixel as the convex combination of its neighboring 3×3 grid of coarse pixels. The weights are parameterized by a neural network that is able to learn optimal weights for each fine pixel. An example is shown below.

## Learning the Flow Offset

It’s important to remember that RAFT doesn’t necessarily estimate the flow, it estimates the *flow offset* from a starting point, and the output is an accumulation of these flow offsets. The first estimate is an update of the previous flow of **0** at the pixel locations of *I₁*. Information about *I₁* comes from the initial hidden state and the context features, this provides continuous feedback in the update block that guides the learning to enable RAFT to estimate the flow offset rather than estimate the flow afresh

RAFT doesn’t estimate the flow, its update block estimates the flow offset from a starting point and the model outputs the accumulation of flow offsets

During training, the recurrent updates of the network mimic the steps of an Optimization Algorithm, where each new flow estimate *fₖ₊₁ = fₖ + Δfₖ *is increasingly scrutinized by the objective function forcing the network to learn more conservative estimates for *Δfₖ *as the number of iterations increases. The objective function captures all flow updates and is the sum of weighted *l1* distances between the flow predictions and ground truth, with exponentially increasing weights.

This imitation of an Optimization Algorithm along with the radius of the lookup operator work together to constrain the search space for each update. which in turn reduces the risk of over-fitting, leads to faster training and, improves generalization.

Now we can start unpacking RAFT and gain insight as to how it makes predictions. The code for this tutorial is located on GitHub where the RAFT code has been modified to output its intermediate feature maps at each major block. The test images are of a counter-clockwise rotating ceiling fan, which provides flows in almost all directions. To get a large flow displacement we will skip a couple of images in the sequence, this will make the flow features more obvious and easier to study.

Now let’s inspect the maps from the Feature Encoder Block.

The Feature Encoder produces *fmap 1* and *fmap 2, *they both look like noise, but in reality they hold crucial information for the flow estimate. The hidden feature maps resemble the input image *I₁*, and they directly seed the hidden state of the ConvGRU in the update block providing valuable information about the pixels in *I₁*. The context feature maps have some resemblance of the input image in that they highlight the strong features such as corners and edges. The original paper suggests that the context features improve the networks ability to accurately discern the spatial motion boundaries.

## The Correlations

The Correlation Pyramid is crucial for computing accurate optical flow, since it is able to capture the correspondences of pixels from *I₁* to *I₂*. As we will see shortly, **the Correlation Volume is the backbone of RAFT**, and we will visualize its superb ability to capture pixel displacement. We will approach this by inspecting several test pixels and see how their 2D response maps are able to capture the relative displacements. We can only see features, yet information about the pixel displacements will still be evident.

The

Correlation Pyramidcaptures the correspondences of pixels in the image sequence across multiple levels of resolution.

The figure below shows the estimated flow of a test image with a few annotated test pixels.

The estimated horizontal and vertical flow field components for each pixel are:

- Pixel 0: (-49.4, -4.3)
- Pixel 1: (-5.8, -26.4)
- Pixel 2: (23.5, -9.3)

## Accessing the Correlation Features

To access the correlation maps for a given test pixel, we add the following function to the corr.py script which obtains the integer index for a given test pixel at any pyramid level.

`def get_corr_idx(loc, lvl, w=71, h=40):`

""" Obtains index of test pixel location in correlation volume.

loc - test pixel location

lvl - Pyramid level

w - 1/8 of padded horizontal image width

h - 1/8 of padded vertical image height

"""

u = np.clip(np.round(loc[2]/(lvl*8)), 0, (w-1))

v = np.clip(np.round(loc[1]/(lvl*8)), 0, (h-1))

return int(u + w*v)

Once we obtain the test pixel index we can obtain its 2D response map, resampled correlation features, and correspondence for each pyramid level.

`test_pixel_idx = get_corr_idx(test_pixel, lvl=(2**i))`# get the 2D response map

corr_response = corr[test_pixel_idx, 0, :, :].detach().cpu().numpy()

# get the resampled correlation feature grids

resampled_corr = bilinear_sampler(corr, coords_lvl)

resampled_corr_response = resampled_corr[test_pixel_idx, 0, :, :].detach().cpu().numpy()

# get the correspondence

pixel_loc = centroid_lvl[test_pixel_idx, :, :, :].cpu().numpy().squeeze()

## Visualizing the Correlation Features

For each pixel, the 2D response map GIFs of the first pyramid level for the first 15 updates are shown below in figures 8-10. The correspondence is denoted by the red square. The large correlation values (bright spots) indicate the relative pixel location in *I₂*. Notice how the correspondence converges around the high correlation value as the network learns the flow offsets. Even though these are large displacements, the all-pairs correlation at the first pyramid level is still able to capture them.

The Correlation Pyramid is able to capture all levels of correlation, but it’s not always obvious. As we go up the pyramid level, things start to become more abstract and it becomes increasingly difficult to determine what RAFT is actually doing and the fact that we are looking at correlations of extracted features makes things even more ambiguous.

## Retrieving the Descent Direction

The correspondence is used to propose the descent direction via the lookup operator. The correlation lookup operator places a grid of radius *r* around the new correspondence location: *x’ = (u + f¹(u) + v + f²(v))*, where* (**f¹**, **f²**) *is the current flow field estimate. The the grid around x’ is used to bilinearly resample from the correlation volume. These resampled grids are the correlation features that are fed into the Update Operator to predict the next flow estimate. The image below shows the first pyramid level 2D correlation response at pixel 1 along with its corresponding bilinearly resampled grid; the top row is the first iteration with a zero flow initialization and the bottom row is the second iteration.

Notice in the top row, how the correlation response on the left is the same as the resampled grid on the right, this is due to the zero flow initialization. We also notice something very important about the bilinearly resampled grid in the top right, the largest value is directly to the left of the center. If we move three pixels to the right and one pixel up, or (3, -1), then we would land on this large value. This is the ** proposed** descent direction that has been retrieved from the correlation volumes. In the iterative update block the network uses this information to formulate the actual descent direction

*Δfₖ*.

On the bottom row, we can see that the correspondence has moved roughly from (39, 34) to (41.93, 33.3), which is a displacement of (2.93, -0.7), showing that the network has actually utilized the proposed descent direction. In the resampled grid on the bottom right, we see that the largest value is in the center and aligned with the correspondence, indicating that the network already has a flow prediction that is close to convergence.

## Motion Features

The motion features are the convolutional encoding of the correlation features and the current flow estimate. They provide pixel flow information to be refined by the update block. Some of the motion feature maps at each iteration are displayed below.

It seems like the motion features correspond to pixels with large movements, and different feature maps seems to correspond to different pixel flows this is apparent with features 126 and 127 on the right of figure 12. They all converge in a similar manner to the actual flow predictions.

In this post we have learned about RAFT and its inner workings. We have seen how the extracted hidden features provide useful information about *I₁* while the extracted context features provide reference info about the strong features of *I₁*. We have visualized how the correlation volume is able to capture information about small and large pixel displacements. It turns out that the correlation concepts from RAFT are used in many subsequent works, the intuition built from the visualizations reinforces this pattern. If you have made it this far congratulations! You now have a deeper understanding of RAFT than what is presented at surface level.

[1] Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. *Computer Vision — ECCV 2020*, 402–419. https://doi.org/10.1007/978-3-030-58536-5_24

## Be the first to comment