
Introduction
In this tutorial we are going to reproduce in Python and explain the RoIAlign function from _torchvision.ops.roialign. I couldn’t find online any code that would exactly reproduce torchvision library results, thus I had to go through and translate into Python the C++ implementation in torchvision that you can find here.
Background
Region of Interest (RoI) in Computer Vision can be defined as a region in a image where a potential object might be located in a object detection task. An example of RoI proposals is shown in Figure 1 below.

One of the Object Detection models where RoI are involved is Faster R-CNN. Faster R-CNN can be is composed of two phases: Region Proposal Network (RPN) which proposes RoIs and says whether each RoI is foreground (contains an object) or background and a classification network that predicts the object class contained in the RoI and offsets, i.e., transformations of RoIs (move and resize) to transform them into final proposals using these offsets to enclose the object better in the bounding box. Classification network does also reject negative proposals which do not contain objects – these negative proposals are classified as background. It’s important to know that RoIs are predicted not in the original image space but in feature space which is extracted from a vision model. The image below illustrates this idea:

We pass in a pretrained vision model the original image and then extract a 3D tensor of features, each in the above case of size 20×15. It can be, however, different depending on which layer we extract features from and which vision model we use. As we can see we can find the exact correspondence of the box in the original image coordinates in image features coordinates. Now, why do we really need RoI pooling? The problem with the RoIs is that they are all of different sizes, while the classification network requires fixed sized features.

Thus, RoI pooling enables us to to map into the same size all the RoIs, e.g. into 3×3 fixed size features, and predict classes they contains and the offsets. There are several variations of RoI pooling – in this article we will focus on RoIAlign. Let’s finally see how this is implemented!
Set up
Let’s first define an example of a feature map to work with. Thus we assume we are at a stage when we extracted a 7×7 dimensional features from the image of interest.

Now, let’s assume we extracted a RoI with the following coordinates in red in Figure 4 (we omit features values in the boxes):

In Figure 4 we also divided our RoI into 4 regions because we are pooling into a 2×2 dimensional __ feature. With RoIAlign we usually do average pooling.

Now the question is, how do we average pool these sub-regions? We can see they are misaligned to the grid, thus we cannot simply average cells within each sub-region. The solution is to sample regularly-spaced points in each sub-region with bi-linear interpolation.
Bi-linear interpolation and pooling
First we need to come up with the points we interpolate in each sub-region in RoI. Below we choose to pool into 2×2 region and we output the points we want to interpolate values for.
# 7x7 image features
img_feats = np.array([[0.5663671 , 0.2577112 , 0.20066682, 0.0127351 , 0.07388048,
0.38410962, 0.2822853 ],
[0.3358975 , 0. , 0. , 0. , 0. ,
0. , 0.07561569],
[0.23596162, 0. , 0. , 0. , 0. ,
0. , 0.04612046],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.18630868, 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.00289604, 0. ]], dtype=np.float32)
# roi proposal
roi_proposal = [2.2821481227874756, 0.3001725673675537, 4.599632263183594, 5.58889102935791]
roi_start_w, roi_start_h, roi_end_w, roi_end_h = roi_proposal
# pooling regions size
pooled_height = 2
pooled_width = 2
# RoI width and height
roi_width = roi_end_w - roi_start_w
roi_height = roi_end_h - roi_start_h
# roi_height= 5.288, roi_width = 2.317
# we divide each RoI sub-region into roi_bin_grid_h x roi_bin_grid_w areas.
# These will define the number of sampling points in each sub-region
roi_bin_grid_h = np.ceil(roi_height / pooled_height)
roi_bin_grid_w = np.ceil(roi_width / pooled_width)
# roi_bin_grid_h = 3, roi_bin_grid_w = 2
# Thus overall we have 6 sampling points in each sub-region
# raw height and weight of each RoI sub-regions
bin_size_h = roi_height / pooled_height
bin_size_w = roi_width / pooled_width
# bin_size_h = 2.644, bin_size_w = 1.158
# variable to be used to calculate pooled value in each sub-region
output_val = 0
# ph and pw define each square (sub-region) RoI is divided into.
ph = 0
pw = 0
# iy and ix represent sampled points within each sub-region in RoI.
# In this example roi_bin_grid_h = 3 and roi_bin_grid_w = 2, thus we
# have overall 6 points for which we interpolate the values and then average
# them to come up with a value for each of the 4 areas in pooled RoI
# sub-regions
for iy in range(int(roi_bin_grid_h)):
# ph * bin_size_h - which square in RoI to pick vertically (on y axis)
# (iy + 0.5) * bin_size_h / roi_bin_grid_h - which of the roi_bin_grid_h
# points vertically to select within square
yy = roi_start_h + ph * bin_size_h + (iy + 0.5) * bin_size_h / roi_bin_grid_h
for ix in range(int(roi_bin_grid_w)):
# pw * bin_size_w - which square in RoI to pick horizontally (on x axis)
# (ix + 0.5) * bin_size_w / roi_bin_grid_w - which of the roi_bin_grid_w
# points vertically to select within square
xx = roi_start_w + pw * bin_size_w + (ix + 0.5) * bin_size_w / roi_bin_grid_w
print(xx, yy)
# xx and yy values:
# 2.57 0.74
# 3.15 0.74
# 2.57 1.62
# 3.15 1.62
# 2.57 2.50
# 3.15 2.50
In Figure 6 we can see the corresponding 6 sample points coordinates for sub-region 1.

To do the bi-linear interpolation of the value corresponding to the first point of coordinates (2.57, 0.74) , we find the box where this point is positioned. So we take the floor of these values – (2, 0) which corresponds to the top-left point of the box _(x_low, ylow) and then adding 1 to these coordinates we find the bottom-right point _(x_high, yhigh) of the box – (3, 1). This is represented in the below Figure:

According to Figure 3, point (0, 2) corresponds to 0.2, point (0,3) to 0.012 and so on. Following the previous code, inside the last loop we find the interpolated value for red point inside the sub-region:
x = xx; y = yy
if y <= 0: y = 0
if x <= 0: x = 0
y_low = int(y); x_low = int(x)
if (y_low >= height - 1):
y_high = y_low = height - 1
y = y_low
else:
y_high = y_low + 1
if (x_low >= width-1):
x_high = x_low = width-1
x = x_low
else:
x_high = x_low + 1
# compute weights and bilinear interpolation
ly = y - y_low; lx = x - x_low;
hy = 1. - ly; hx = 1. - lx;
w1 = hy * hx; w2 = hy * lx; w3 = ly * hx; w4 = ly * lx;
output_val += w1 * img_feats[y_low, x_low] + w2 * img_feats[y_low, x_high] +
w3 * img_feats[y_high, x_low] + w4 * img_feats[y_high, x_high]
So we have for the red point the following result:

If we then do it for all the 6 points in the sub-region, we get the following results:
# interpolated values for each point in the sub-region
[0.0241, 0.0057, 0., 0., 0., 0.]
# if we then take the average we get the pooled average value for
# the first region:
0.004973
At the end we get the following average pooled results:

The full code:
img_feats = np.array([[0.5663671 , 0.2577112 , 0.20066682, 0.0127351 , 0.07388048,
0.38410962, 0.2822853 ],
[0.3358975 , 0. , 0. , 0. , 0. ,
0. , 0.07561569],
[0.23596162, 0. , 0. , 0. , 0. ,
0. , 0.04612046],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.18630868, 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.00289604, 0. ]], dtype=np.float32)
roi_proposal = [2.2821481227874756, 0.3001725673675537, 4.599632263183594, 5.58889102935791]
def precalc_bilinear(height, width, pooled_height, pooled_width, roi_start_h, roi_start_w, bin_size_h, bin_size_w, roi_bin_grid_h, roi_bin_grid_w):
precalc = dict()
pre_calc_index = 0
# ph and pw defines each square that RoI is being divided into. For example,
# pooled_height = pooled_width = 2 we divide RoI into 4 areas thus each ph and pw
# we sample points from each of those areas (e.g. 0,0 - sample from square 0,0)
for ph in range(int(pooled_height)):
for pw in range(int(pooled_width)):
# iy and ix represent sampled points within each area in RoI.
# For example, roi_bin_grid_h = 3 and roi_bin_grid_w = 2 we will
# have overall 6 points we interpolate the values of and then average them to
# come up with a value for each of the 4 areas in pooled RoI region (which is
# 2 x 2 if pooled_height = pooled_width = 2)
for iy in range(int(roi_bin_grid_h)):
# ph * bin_size_h - which square in RoI to pick vertically (on y axis)
# (iy + 0.5) * bin_size_h / roi_bin_grid_h - which of the roi_bin_grid_h points
# vertically to select within square
yy = roi_start_h + ph * bin_size_h + (iy + 0.5) * bin_size_h / roi_bin_grid_h
for ix in range(int(roi_bin_grid_w)):
# pw * bin_size_w - which square in RoI to pick horizontally (on x axis)
# (ix + 0.5) * bin_size_w / roi_bin_grid_w - which of the roi_bin_grid_w points
# vertically to select within square
xx = roi_start_w + pw * bin_size_w + (ix + 0.5) * bin_size_w / roi_bin_grid_w
x = xx
y = yy
if y <= 0: y = 0
if x <= 0: x = 0
y_low = int(y); x_low = int(x)
if (y_low >= height - 1):
y_high = y_low = height - 1
y = y_low
else:
y_high = y_low + 1
if (x_low >= width-1):
x_high = x_low = width - 1
x = x_low
else:
x_high = x_low + 1
ly = y - y_low; lx = x - x_low;
hy = 1. - ly; hx = 1. - lx;
w1 = hy * hx; w2 = hy * lx; w3 = ly * hx; w4 = ly * lx;
pos1 = (y_low, x_low); pos2 = (y_low, x_high)
pos3 = (y_high, x_low); pos4 = (y_high, x_high)
precalc[pre_calc_index] = (pos1, pos2, pos3, pos4, w1, w2, w3, w4)
pre_calc_index += 1;
# print(x, y)
return precalc
aligned = False
pooled_height = 2
pooled_width = 2
sampling_ratio = -1
height, width = img_feats.shape
if aligned:
offset = 0.5
else:
offset = 0
roi_start_w, roi_start_h, roi_end_w, roi_end_h = roi_proposal
roi_start_w -= offset
roi_start_h -= offset
roi_end_w -= offset
roi_end_h -= offset
roi_width = roi_end_w - roi_start_w
roi_height = roi_end_h - roi_start_h
if not aligned:
roi_width = max(roi_width, 1)
roi_height = max(roi_height, 1)
bin_size_h = roi_height / pooled_height
bin_size_w = roi_width / pooled_width
# we divide each RoI subregion into roi_bin_grid_h x roi_bin_grid_w areas
roi_bin_grid_h = sampling_ratio if sampling_ratio > 0 else np.ceil(roi_height / pooled_height)
roi_bin_grid_w = sampling_ratio if sampling_ratio > 0 else np.ceil(roi_width / pooled_width)
precalc = precalc_bilinear(height, width, pooled_height, pooled_width, roi_start_h, roi_start_w, bin_size_h, bin_size_w, roi_bin_grid_h, roi_bin_grid_w)
# We do average (integral) pooling inside a bin
# When the grid is empty, output zeros.
count = max(roi_bin_grid_h * roi_bin_grid_w, 1)
output = np.zeros((pooled_height, pooled_width))
pre_calc_index = 0
for ph in range(int(pooled_height)):
for pw in range(int(pooled_width)):
output_val = 0
for iy in range(int(roi_bin_grid_h)):
for ix in range(int(roi_bin_grid_w)):
(pos1, pos2, pos3, pos4, w1, w2, w3, w4) = precalc[pre_calc_index]
(y_low, x_low) = pos1
(y_low, x_high) = pos2
(y_high, x_low) = pos3
(y_high, x_high) = pos4
output_val += w1 * img_feats[y_low, x_low] + w2 * img_feats[y_low, x_high] +
w3 * img_feats[y_high, x_low] + w4 * img_feats[y_high, x_high]
pre_calc_index += 1
# we do average pooling here
output[ph, pw] = output_val/count
Additional comments to the code
The code above contains some additional features we did not discuss that I will briefly explain here:
- you can change the align variable to be either True or False. If True, pixel shift the box coordinates by -0.5 for a better alignment with the two neighboring pixel indices. This version is used in Detectron2.
- _samplingratio defines the number of sampling points in each sub-region of a RoI as illustrated in Figure 6 where 6 sampling points were used. If _samplingratio = -1 , then it’s computed automatically as we saw in the first code snippet:
roi_bin_grid_h = np.ceil(roi_height / pooled_height)
roi_bin_grid_w = np.ceil(roi_width / pooled_width)
Conclusions
In this article we have seen how RoIAlign works and how it is implemented in torchvision library. RoIAlign can be seen as a layer in a neural network architecture and as every layer you can propagate forward and backword through it, enabling to train your models end-to-end. After reading this article I would encourage you to also read about RoI pooling and why RoIAlign is preferred to it. If you understood RoIAlign, understanding RoI pooling shouldn’t be a problem.