MiCT-Net for Human Action Recognition in Videos

How to mix 3D & 2D convolutions using cross domain residual connections

Published in

Towards Data Science

7 min readFeb 8, 2020

Recently a group of researchers from Microsoft published a paper [1] which introduced an hybrid 3D/2D convolutional neural network architecture for human action recognition in videos. The paper reports state of the art performance on the UCF-101 and HMDB-51 data sets while reducing the model complexity by using half less 3D convolutions than previous work.

The authors observed that 3D ConvNets exhibit disappointing performance on this task because they are difficult to train, and their memory requirements limit their depth. Their architecture dubbed Mixed Convolutional Tube Network or MiCT-Net resolves around the idea of combining the efficiency of a 2D-CNN backbone with additional 3D residual convolutions introduced at key locations, to generate deeper and more informative feature maps.

The authors’ source code is not public, and this post is the result of my attempt to reproduce some of their work, which led me to implement MiCT-Net on PyTorch using a ResNet backbone, and named it MiCT-ResNet. My code is available on this repository and free to use in your own projects.

Mixed 3D/2D Convolutional Tube (MiCT)

The MiCT block is built around the observation that 2D and 3D convolutions can complement each other to increase the overall network performance.

On the one hand, 2D ConvNets can learn deep spatial representations but miss the temporal structure from videos that is needed to separate similar classes.

On the other hand, 3D ConvNets are efficient at extracting spatio-temporal features but are difficult to stack as deep networks because the exponential growth of the solution space makes them difficult to optimize.

Illustration of the MiCT block borrowed from [1] which makes use of a skip connection to combine 3D and 2D convolutions.

The authors’ idea is to combine the best of both worlds by mixing a limited number of 3D convolutional layers with a 2D-CNN backbone. As depicted above, the 3D and 2D convolutions are mixed in two ways:

A 3D convolution is added between the input and output of a 2D convolution. This 3D convolution adds another level of feature learning at temporal level. The features of the two branches are then merged with a cross domain element-wise summation. This operation reduces the complexity of the spatio-temporal fusion as the 3D convolution branch learns only residual temporal features, which are the motion of objects and persons in videos, while the spatial features are learned by the 2D convolution.

Then, a 2D convolution is appended after the summation to extract much deeper features during each round of spatio-temporal fusion.

MiCT Block Implementation

The network input is a mini-batch of video clips represented as a 5-dimensional tensor of size NxCxDxHxW, where N denotes the mini-batch size, C the number of channels, D the clip duration, and H and W represent respectively the height and width in the spatial domain.

A first issue immediately arises as 3D and 2D convolutions operate on tensors of different sizes: 3D convolutions require 5D input tensors whereas 2D convolutions require 4D input tensors. Also, the outputs of the 3D and 2D convolutions cannot be directly summed up.

Therefore, we need to transform tensors back and forth from 5D to 4D before the skip connection and before the fusion operation for example. Fortunately, 5D video tensors are nothing else than mini-batches of image sequences which can be stacked all together to form a larger 4D mini-batch with size (NxD)xCxHxW.

The _to_4d_tensor function implements this transformation along with an optional temporal down-sampling specified with the depth_stride parameter.

The _to_5d_tensor function performs the inverse transformation taking a 4D input tensor and a depth parameter specifying the sequence length to restore.

We are now ready to dive into the forward pass of the MiCT block with the next code snippet. The first part performs the 3D convolution after padding the input tensor to make the 3D convolution process the current frame and the next few frames. Next, we perform the first 2D convolution and fuse the result with the output of the 3D convolution in 5D space. Finally, we perform the second 2D convolution and return the result as a 5D tensor ready for processing by the next MiCT block.

The MiCT-ResNet Architecture

The paper uses a custom backbone inspired from Inception. It is made of 4 MiCT blocks each containing several Inception blocks. I have chosen to use the ResNet backbone instead to be able to compare results with 3D-ResNets and to benefit from pre-trained weights on ImageNet.

The MiCT-ResNet-18 architecture is essentially a ResNet-18 augmented with five 3D residual convolutions.

As shown above using ResNet-18, which is the shallowest ResNet backbone, the architecture uses five 3D convolutions, one at the entrance of the network and one at the beginning of each of the four main ResNet blocks. The BasicBlock is the standard ResNet block. The batch normalization and ReLU layer after each convolution are omitted for clarity.

If you are not familiar with the ResNet implementation in PyTorch, this post provides a step by step walk-through to get you up to speed.

UCF-101 Data Set

UCF-101 [3] is a famous action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. All videos are 320x240 in size at 25 frames per second.

One thing to note is that despite its 13320 videos and 100+ clips for each category, it is a relatively small data set for the task and prone to over-fitting. All the clips are taken from only 2.5k distinct videos. For example, one video of the same person playing the piano is cut in 7 clips. It means there is far less variation than if the action in each clip were performed by a different person with different lightning conditions.

Example frames borrowed from [3] for 15 of the human actions.

Experimental Setting

The goal of the experiment is to compare the performance of MiCT-ResNet and 3D-ResNet in a context with limited training data. Pre-training on large scale datasets such as Kinetics is out of scope.

To facilitate results comparison both networks are based on the ResNet-18 backbone and have a temporal stride of 16 and a spatial stride of 32. Weights are initialized from ImageNet pre-trained weights. For 3D-ResNet, the 3D filters are bootstrapped by repeating the weights of the 2D filters N times along the temporal dimension, and re-scaling them by dividing by N.

To support training with a large number of video clips per batch, the models’ input size is set to 160x160. Each video is randomly down-sampled along the temporal dimension and a set of 16 consecutive frames is randomly chosen. The sequence is looped as necessary to obtain 16 frame clips. At test time, the first 16 frames of the video are selected.

The SGD optimizer is used with a learning rate of 1e-2 and a batch size of 128. Weight decay, dropout and data augmentation are applied to reduce over-fitting. More details about the training procedure can be found in the repository.

Results

The models are evaluated against the standard Top-1 and Top-5 accuracies. All results are averaged across the three standard splits of UCF-101. MiCT-ResNet-18 leads by 1.5 point while being 3.1 times faster which confirms the validity of the approach of the authors.

Performance comparison. The memory size is given for a batch size of one.

In the second experiment, the temporal stride of MiCT-ResNet is reduced from 16 to 4, and the network is tested on a varying number of clip lengths. The best results of 69.3 Top-1 (cross validated) accuracy for MiCT-ResNet-18 and 72.8 Top-1 (cross validated) accuracy for MiCT-ResNet-34 are achieved for sequences of 300 frames.

MiCT-ResNet-18 and MiCT-ResNet-34 validation accuracies as a function of video length.

To Conclude

We saw that mixing 3D and 2D convolutions is a good strategy for improving performance compared to deep 3D convolutional networks on the UCF-101 data set. MiCT-ResNet delivers higher accuracy and much faster inference speed.

Yet it is not possible to give a definitive conclusion on the relative performance of these two architectures. In parallel to the work from Microsoft on MiCT-Net, another team from Google DeepMind has shown [2] that pre-training 3D-ConvNets on very large video data sets like Kinetics increases considerably their performance on transfer learning tasks like UCF-101.

Therefore, it remains an open question how these two architectures compare if they were both pre-trained on ImageNet and Kinetics. Let me know if you have access to the Kinetics data set and are willing to provide the answer!

References

[1] Y. Zhou, X. Sun, Z-J. Zha and W. Zeng. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition, June 2018.

[2] J. Carreira and A. Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, February 2018.

[3] K. Soomro, A. Roshan Zamir and M. Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild, CRCV-TR-12–01, November 2012.