Making Sense of Big Data
Abstract
Convolutional Neural Networks (CNNs) are expensive in terms of compute requirements, but more so in terms of the working RAM requirement. The primary reason is that they have large intermediate activation maps. This is a major barrier for deployment on low-resource devices. So this leaves us with broadly two options:
- Reduce the number of channels, which leads to poor performance in general, as observed empirically.
- Reduce the number of rows/columns, which begs the question, how can we reduce rows/columns by pooling/strided convolutions?
Standard pooling operations, when applied to rapidly reduce the number of rows/columns, lead to a heavy drop in accuracy. RNNPool is a new pooling operator that enables significant reduction in rows/columns without any significant loss in accuracy. Our work has been presented in a NeurIPS ’20 paper and a NeurIPS spotlight talk. This Microsoft Research blogpost provides a high level overview of the new operator and its implications.
This post walks through how RNNPool can be used to reduce CNNs’ memory requirement. We illustrate its versatility by going to the extreme, i.e., by enabling face detection on a tiny microcontroller. Another critical aspect of deploying solutions in resource-constrained settings is quantizing the model (as explained here lucidly by Joel Nicholls) so that it can be executed on Microcontrollers without floating point operations, utilizing the speed and memory advantages of 8-bit and 16-bit integer computations. This post also illustrates the capabilities of the SeeDot (now Shiftry) compiler from Microsoft Research India, which can be used to quantize the _RNNPoo_l-based _PyTorc_h models and generate C code with optimal memory usage.
Finally, this post shows how you can put RNNPool and SeeDot together to train and deploy a face detection model on an ST microcontroller housing an ARM Cortex-M3/M4 class microprocessor (or above) with less than 256 KBs of RAM.
The RNNPool Operator: What and Why?
Pooling operators are often used to down-sample the activation maps. Typical pooling operators are Average Pooling and Max Pooling. They perform "gross" operations on receptive field, and hence are not accurate.

RNNs are interesting as they require small amount of memory but can be more expensive than convolutions in terms of compute. However, by bringing down the activation map size substantially, RNNs can reduce the overall computation of several standard vision architectures by 2–4x.
╔═══════════════╦═════════════════════════╦═════════════════════════╗
║ Model ║ Standard Models ║ RNNPool-based Models ║
╠═══════════════╬═════╦══════╦══════╦═════╬═════╦══════╦══════╦═════╣
║ ║Acc. ║Params║ RAM ║MAdds║Acc. ║Params║ RAM ║MAdds║
║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║MobileNetV2 ║94.20║2.20M ║2.29MB║0.30G║94.40║2.00M ║0.24MB║0.23G║
║EfficientNet-B0║96.00║4.03M ║2.29MB║0.39G║96.40║3.90M ║0.25MB║0.33G║
║ResNet18 ║94.80║11.20M║3.06MB║1.80G║94.40║10.60M║0.38MB║0.95G║
║DenseNet121 ║95.40║6.96M ║3.06MB║2.83G║94.80║5.60M ║0.77MB║1.04G║
║GoogLeNet ║96.00║9.96M ║3.06MB║1.57G║95.60║9.35M ║0.78MB║0.81G║
╚═══════════════╩═════╩══════╩══════╩═════╩═════╩══════╩══════╩═════╝
Comparison of inference complexity and accuracy with and without RNNPool layer on ImageNet-10 dataset.
As RNNPool is a more "fine-grained" and better learnt pooling operator, it can lead to higher accuracy as well.
╔═════════════════════════════════╦══════════╦════════╦════════╗
║ Method ║ Accuracy ║ MAdds ║ Params ║
╠═════════════════════════════════╬══════════╬════════╬════════╣
║ Base Network ║ 94.20 ║ 0.300G ║ 2.2M ║
║ Last Layer RNNPool ║ 95.00 ║ 0.334G ║ 2.9M ║
║ Average Pooling ║ 90.80 ║ 0.200G ║ 2.0M ║
║ Max Pooling ║ 92.80 ║ 0.200G ║ 2.0M ║
║ Strided Convolution ║ 93.00 ║ 0.258G ║ 2.1M ║
║ ReNet ║ 92.20 ║ 0.296G ║ 2.3M ║
║ RNNPoolLayer ║ 94.40 ║ 0.226G ║ 2.0M ║
║ RNNPoolLayer+Last Layer RNNPool ║ 95.60 ║ 0.260G ║ 2.7M ║
╚═════════════════════════════════╩══════════╩════════╩════════╝
Impact of various down-sampling and pooling operators on the accuracy, inference complexity and the model size of MobileNetV2 on ImageNet-10 dataset.
What Can We Do With The RNNPool Operator?
We will see how to use RNNPool to build a compact face detection architecture that is suitable for deployment on microcontrollers.
We will start with the popular S3FD architecture. S3FD uses a CNN based back-bone network – VGG16 – for feature extraction. We modify this back-bone using RNNPool and MBConv blocks for efficiency. Detection layers are attached at various depths in this network to detect faces in bounding boxes at different scales. Each detection layer uses a single convolution layer to predict 6 outputs values per spatial location, 4 of which are the bounding box coordinates and 2 are for the classes.
╔═══════════╦══════════════╦═══════════╦═════════════╦════════╗
║ Input ║ Operator ║ Expansion ║ Out Channel ║ Stride ║
╠═══════════╬══════════════╬═══════════╬═════════════╬════════╣
║ 320x240x1 ║ Conv2D 3x3 ║ 1 ║ 4 ║ 2 ║
║ 160x120x4 ║ RNNPoolLayer ║ 1 ║ 64 ║ 4 ║
║ 40x30x64 ║ Bottleneck ║ 2 ║ 32 ║ 1 ║
║ 40x30x32 ║ Bottleneck ║ 2 ║ 32 ║ 1 ║
║ 40x30x32 ║ Bottleneck ║ 2 ║ 64 ║ 2 ║
║ 20x15x64 ║ Bottleneck ║ 2 ║ 64 ║ 1 ║
╚═══════════╩══════════════╩═══════════╩═════════════╩════════╝
Using RNNPool at beginning after introducing a convolution.
EdgeML + SeeDot: Model Training, Quantization and Code Generation
Here, we present an end-to-end pipeline for training our designed models on WIDER-FACE dataset (models written in PyTorch), fine-tuning them on the SCUT-Head Part B dataset, quantizing them using the SeeDot compiler, and generating the C code, which can be customized for deployment on a choice of low-end microcontrollers. We chose these datasets specifically to train the models for a conference room head-counting scenario.
We show how two sample models here which can be deployed on M4 class microcontrollers directly:
- _RNNPool_Face_QVGA_Monochrome: A model with 14 MBConv layers that provides a high mAP score (overlap between predicted and actual bounding boxes) on the validation dataset. This model is ideal for situations where accuracy is of the utmost importance._
- _RNNPool_Face_M4: A smaller model with lower prediction latency that has 4 MBConv layers and sparse weights for RNNPool layer that provides a reasonable mAP score. This model is ideal for situations where latency is of the utmost importance._
╔═════════════════════╦═════════════════╦═════════╗
║ Metrics ║ QVGA MONOCHROME ║ FACE M4 ║
╠═════════════════════╬═════════════════╬═════════╣
║ Flash Size (KB) ║ 450 ║ 160 ║
║ Peak RAM Usage (KB) ║ 185 ║ 185 ║
║ Accuracy (mAP) ║ 0.61 ║ 0.58 ║
║ Compute (MACs) ║ 228M ║ 110M ║
║ Latency (seconds) ║ 20.3 ║ 10.5 ║
╚═════════════════════╩═════════════════╩═════════╝
The Jupyter notebook can be configured in the first cell to switch between two different models and other training environment configurations. Each command is accompanied by explanatory comments detailing their purpose. For a more detailed explanation, one can also refer to the README files.
As of the time of publication of this blog, we’re also adding support for automated PyTorch-to-TFLite converters to allow interested users to build and train custom models using RNNPool + SeeDot, suitable for deployment on low-resource devices.
All of the requisite code for putting this end-to-end pipeline together is available at the EdgeML repository maintained by Microsoft Research India. Additionally, the Jupyter notebook can be accessed here.
After setting the configuration in the first cell and running the entire notebook, the generated code will be dumped inside the EdgeML/tools/SeeDot/m3dump/ folder. Thus, we can now deploy this code on a microcontroller.
Setting Up The Deployment Environment
We deployed our models on a NUCLEO-F439ZI board and ran benchmarks on our code in the same environment. Alternatively, one can pick any other microcontroller board which satisfies the basic RAM availability of 192 KBs and Flash availability of 450 KBs (for _RNNPool_Face_QVGAMonochrome model) and 170 KBs (for _RNNPool_FaceM4 model).
For a detailed look at setting up the development environments related to ARM Cortex-M class devices, we’d recommend the reader to refer to Carmine Novello’s fantastic book as well as his amazing engineering blog on the subject matter.
1) Install Java (version 8 or better).
- Install Eclipse IDE for C/C++ (version 2020–09 tested on both Linux and Windows).
- Install CDT and GNU ARM Embed CDT plugins for Eclipse (go to Help->Install New Software.. and see the images below).

Click on the CDT link available in the drop-down menu and choose CDT Main Features option. Click on Next> and follow the remaining installation instructions from the window.
Additionally, use the link provided above to Add.. the ARM Embed CDT plugins link to the set of software installation sources, as shown below and install the Embedded CDT tools as well.



- Install libusb-1.0 (for Linux-based systems only, using the command):
$ sudo apt-get install libusb-1.0
- Install GNU Arm Embedded Toolchain which includes the GNU GCC based cross-compiler for Cortex-M devices supplied directly by ARM (version 2020-q2 tested on both Linux and Windows).
- Install OpenOCD software for debugging the board of choice (version 0.10.0 tested on both Linux and Windows).
- Optional: Install ST-LINK Firmware Upgrade Software and upgrade the board firmware (only if the board is being used for the first time).
- Optional: Install STM32CubeMX Software for generating the initialization code for your microcontroller, according to specific requirements.
Now, one can deploy code in either of the following two ways:
- Setup a new project from scratch by generating initialization code using STM32CubeMX (depending on specific requirement of functionalities), and add the generated .c and .h files to the src/ and include/ folders respectively.
- Use one of our sample deployment snippets (available [here](https://github.com/microsoft/EdgeML/blob/SeeDOT-Face-Detection/examples/pytorch/vision/Face_Detection/sample%20codes/M4%20Initialization%20Codes.zip)), where we generate a minimum-requirement initialization code using STM32CubeMX (available here) and add our SeeDot-generated code over to this base sample. These sample snippets have been optimized further for speed, and can provide around 15% faster execution times.
Here, we focus on the latter option only. However, we encourage interested readers to try out the former option as well.
For deploying the codes, simply copy-paste the entire folder inside your Eclipse workspace directory, as shown below.





Setup the path to the downloaded compiler using the Eclipse GUI and build the project binaries using the compiler toolchain as follows.




As a final step, we need to setup a debugging configuration. We achieve this through OpenOCD, which allows us to control step-by-step debugging through the host PC itself.




Now, we’re set to run our compiled binary on our microcontroller of choice. Simply connect the board to the PC and click on Debug button. A console running OpenOCD should fire up, and halt immediately.

Final Deployment
Taking a scaled image as input (stored in main.h file as _inputimage variable), we execute the entire pipeline on our model of choice, and set the output directory (mentioned in the _runtest() function of the main.c file) as desired. The scaling process for the arbitrary images can be done using the __ scale_image.py file. Simply run,
$ IS_QVGA_MONO=1 python scale_image.py --image_path <path_to_image> --scale <scaleForX value in m3dump/scales.h file> > output.txt
and copy-paste the entire text from output.txt file, and replace the _inputimage variable in main.h.
The model generates location (loc) and confidence (conf) bounding boxes for each region of the image that it considers as a face.



Finally, we need to superimpose the obtained outputs over our original image to get the final desired output. For that, one can simply make use of the eval_fromquant.py script as:
$ IS_QVGA_MONO=1 python eval_fromquant.py --save_dir <output image directory> --thresh <threshold value for bounding boxes> --image_dir <input image directory containing original test image> --trace_file <path to output file generated by microcontroller> --scale <scaleForY value in m3dump/scales.h file>
which gives us our desired output image, a copy of which would be stored in th_e savedir folder.



Conclusion
Thus, we have a functional face detection pipeline in C, which is executable on a tiny microcontroller. We can input an image to the microcontroller, execute the code to generate these conf and loc outputs, and send these outputs back to the host PC to obtain the final image with the bounding boxes superimposed as shown.
One can further extend this idea by integrating a camera module to the microcontroller for on-the-fly image capturing and processing. The exposition of the technicalities of that work would be out of the scope of this post, unfortunately. We experimented with a 2 megapixel OmniVision OV7670 camera module for this purpose, on top of the _RNNPool_FaceM4 model. The code for the same can be checked out here (titled _RNNPool_FaceM4 Camera).


This blog-post was a joint effort by a number of amazing researchers, research fellows and interns at Microsoft Research India. Credits go to Oindrila Saha, Aayan Kumar, Harsha Vardhan Simhadri, Prateek Jain and Rahul Sharma for contributing to the different sections of this write-up.