Making Sense of Big Data

Enabling Accurate Computer Vision on Tiny Microcontrollers with RNNPool Operator and SeeDot Compiler

A tutorial on using the Microsoft EdgeML repository to train, quantize and deploy accurate face detection models on ARM Cortex-M class devices.

Shikhar Jaiswal
Towards Data Science
11 min readJan 8, 2021

--

Abstract

Convolutional Neural Networks (CNNs) are expensive in terms of compute requirements, but more so in terms of the working RAM requirement. The primary reason is that they have large intermediate activation maps. This is a major barrier for deployment on low-resource devices. So this leaves us with broadly two options:

  1. Reduce the number of channels, which leads to poor performance in general, as observed empirically.
  2. Reduce the number of rows/columns, which begs the question, how can we reduce rows/columns by pooling/strided convolutions?

Standard pooling operations, when applied to rapidly reduce the number of rows/columns, lead to a heavy drop in accuracy. RNNPool is a new pooling operator that enables significant reduction in rows/columns without any significant loss in accuracy. Our work has been presented in a NeurIPS ’20 paper and a NeurIPS spotlight talk. This Microsoft Research blogpost provides a high level overview of the new operator and its implications.

This post walks through how RNNPool can be used to reduce CNNs’ memory requirement. We illustrate its versatility by going to the extreme, i.e., by enabling face detection on a tiny microcontroller. Another critical aspect of deploying solutions in resource-constrained settings is quantizing the model (as explained here lucidly by Joel Nicholls) so that it can be executed on microcontrollers without floating point operations, utilizing the speed and memory advantages of 8-bit and 16-bit integer computations. This post also illustrates the capabilities of the SeeDot (now Shiftry) compiler from Microsoft Research India, which can be used to quantize the RNNPool-based PyTorch models and generate C code with optimal memory usage.

Finally, this post shows how you can put RNNPool and SeeDot together to train and deploy a face detection model on an ST microcontroller housing an ARM Cortex-M3/M4 class microprocessor (or above) with less than 256 KBs of RAM.

The RNNPool Operator: What and Why?

Pooling operators are often used to down-sample the activation maps. Typical pooling operators are Average Pooling and Max Pooling. They perform “gross” operations on receptive field, and hence are not accurate.

RNNPool uses 4 passes of RNNs to aggregate large patches of activation maps. (Photo by author)

RNNs are interesting as they require small amount of memory but can be more expensive than convolutions in terms of compute. However, by bringing down the activation map size substantially, RNNs can reduce the overall computation of several standard vision architectures by 2–4x.

╔═══════════════╦═════════════════════════╦═════════════════════════╗
║ Model ║ Standard Models ║ RNNPool-based Models ║
╠═══════════════╬═════╦══════╦══════╦═════╬═════╦══════╦══════╦═════╣
║ ║Acc. ║Params║ RAM ║MAdds║Acc. ║Params║ RAM ║MAdds║
║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║MobileNetV2 ║94.20║2.20M ║2.29MB║0.30G║94.402.00M0.24MB0.23G
║EfficientNet-B0║96.00║4.03M ║2.29MB║0.39G║96.403.90M0.25MB0.33G
║ResNet18 ║94.80║11.20M║3.06MB║1.80G║94.40║10.60M0.38MB0.95G
║DenseNet121 ║95.40║6.96M ║3.06MB║2.83G║94.80║5.60M0.77MB1.04G
║GoogLeNet ║96.00║9.96M ║3.06MB║1.57G║95.60║9.35M0.78MB0.81G
╚═══════════════╩═════╩══════╩══════╩═════╩═════╩══════╩══════╩═════╝

Comparison of inference complexity and accuracy with and without RNNPool layer on ImageNet-10 dataset.

As RNNPool is a more “fine-grained” and better learnt pooling operator, it can lead to higher accuracy as well.

  ╔═════════════════════════════════╦══════════╦════════╦════════╗
║ Method ║ Accuracy ║ MAdds ║ Params ║
╠═════════════════════════════════╬══════════╬════════╬════════╣
║ Base Network ║ 94.20 ║ 0.300G ║ 2.2M ║
║ Last Layer RNNPool ║ 95.00 ║ 0.334G ║ 2.9M ║
║ Average Pooling ║ 90.80 ║ 0.200G2.0M
║ Max Pooling ║ 92.80 ║ 0.200G2.0M
║ Strided Convolution ║ 93.00 ║ 0.258G ║ 2.1M ║
║ ReNet ║ 92.20 ║ 0.296G ║ 2.3M ║
║ RNNPoolLayer ║ 94.40 ║ 0.226G ║ 2.0M
║ RNNPoolLayer+Last Layer RNNPool ║ 95.60 ║ 0.260G ║ 2.7M ║
╚═════════════════════════════════╩══════════╩════════╩════════╝

Impact of various down-sampling and pooling operators on the accuracy, inference complexity and the model size of MobileNetV2 on ImageNet-10 dataset.

What Can We Do With The RNNPool Operator?

We will see how to use RNNPool to build a compact face detection architecture that is suitable for deployment on microcontrollers.

We will start with the popular S3FD architecture. S3FD uses a CNN based back-bone network — VGG16 — for feature extraction. We modify this back-bone using RNNPool and MBConv blocks for efficiency. Detection layers are attached at various depths in this network to detect faces in bounding boxes at different scales. Each detection layer uses a single convolution layer to predict 6 outputs values per spatial location, 4 of which are the bounding box coordinates and 2 are for the classes.

   ╔═══════════╦══════════════╦═══════════╦═════════════╦════════╗
║ Input ║ Operator ║ Expansion ║ Out Channel ║ Stride ║
╠═══════════╬══════════════╬═══════════╬═════════════╬════════╣
║ 320x240x1 ║ Conv2D 3x3 ║ 1 ║ 4 ║ 2 ║
║ 160x120x4 ║ RNNPoolLayer ║ 1 ║ 64 ║ 4 ║
║ 40x30x64 ║ Bottleneck ║ 2 ║ 32 ║ 1 ║
║ 40x30x32 ║ Bottleneck ║ 2 ║ 32 ║ 1 ║
║ 40x30x32 ║ Bottleneck ║ 2 ║ 64 ║ 2 ║
║ 20x15x64 ║ Bottleneck ║ 2 ║ 64 ║ 1 ║
╚═══════════╩══════════════╩═══════════╩═════════════╩════════╝

Using RNNPool at beginning after introducing a convolution.

EdgeML + SeeDot: Model Training, Quantization and Code Generation

Here, we present an end-to-end pipeline for training our designed models on WIDER-FACE dataset (models written in PyTorch), fine-tuning them on the SCUT-Head Part B dataset, quantizing them using the SeeDot compiler, and generating the C code, which can be customized for deployment on a choice of low-end microcontrollers. We chose these datasets specifically to train the models for a conference room head-counting scenario.

We show how two sample models here which can be deployed on M4 class microcontrollers directly:

  1. RNNPool_Face_QVGA_Monochrome: A model with 14 MBConv layers that provides a high mAP score (overlap between predicted and actual bounding boxes) on the validation dataset. This model is ideal for situations where accuracy is of the utmost importance.
  2. RNNPool_Face_M4: A smaller model with lower prediction latency that has 4 MBConv layers and sparse weights for RNNPool layer that provides a reasonable mAP score. This model is ideal for situations where latency is of the utmost importance.
         ╔═════════════════════╦═════════════════╦═════════╗
║ Metrics ║ QVGA MONOCHROME ║ FACE M4 ║
╠═════════════════════╬═════════════════╬═════════╣
║ Flash Size (KB) ║ 450 ║ 160
║ Peak RAM Usage (KB) ║ 185185
║ Accuracy (mAP) ║ 0.61 ║ 0.58 ║
║ Compute (MACs) ║ 228M ║ 110M
║ Latency (seconds) ║ 20.3 ║ 10.5
╚═════════════════════╩═════════════════╩═════════╝

The Jupyter notebook can be configured in the first cell to switch between two different models and other training environment configurations. Each command is accompanied by explanatory comments detailing their purpose. For a more detailed explanation, one can also refer to the README files.

Jupyter Notebook for End-to-End Training, Quantization and Code Generation.

As of the time of publication of this blog, we’re also adding support for automated PyTorch-to-TFLite converters to allow interested users to build and train custom models using RNNPool + SeeDot, suitable for deployment on low-resource devices.

All of the requisite code for putting this end-to-end pipeline together is available at the EdgeML repository maintained by Microsoft Research India. Additionally, the Jupyter notebook can be accessed here.

After setting the configuration in the first cell and running the entire notebook, the generated code will be dumped inside the EdgeML/tools/SeeDot/m3dump/ folder. Thus, we can now deploy this code on a microcontroller.

Setting Up The Deployment Environment

We deployed our models on a NUCLEO-F439ZI board and ran benchmarks on our code in the same environment. Alternatively, one can pick any other microcontroller board which satisfies the basic RAM availability of 192 KBs and Flash availability of 450 KBs (for RNNPool_Face_QVGA_Monochrome model) and 170 KBs (for RNNPool_Face_M4 model).

For a detailed look at setting up the development environments related to ARM Cortex-M class devices, we’d recommend the reader to refer to Carmine Novello’s fantastic book as well as his amazing engineering blog on the subject matter.

1) Install Java (version 8 or better).

2. Install Eclipse IDE for C/C++ (version 2020–09 tested on both Linux and Windows).

3. Install CDT and GNU ARM Embed CDT plugins for Eclipse (go to Help->Install New Software.. and see the images below).

Installing CDT Plugins. (Screenshot by author)

Click on the CDT link available in the drop-down menu and choose CDT Main Features option. Click on Next> and follow the remaining installation instructions from the window.

Additionally, use the link provided above to Add.. the ARM Embed CDT plugins link to the set of software installation sources, as shown below and install the Embedded CDT tools as well.

Pick the stable plug-in installation channel. (Screenshot by author)
Add the channel to the list of software installation sites and name it GNU MCU Eclipse Plug-ins. (Screenshot by author)
Install all the resources shown on by clicking Next> and following from there. (Screenshot by author)

4. Install libusb-1.0 (for Linux-based systems only, using the command):

$ sudo apt-get install libusb-1.0

5. Install GNU Arm Embedded Toolchain which includes the GNU GCC based cross-compiler for Cortex-M devices supplied directly by ARM (version 2020-q2 tested on both Linux and Windows).

6. Install OpenOCD software for debugging the board of choice (version 0.10.0 tested on both Linux and Windows).

7. Optional: Install ST-LINK Firmware Upgrade Software and upgrade the board firmware (only if the board is being used for the first time).

8. Optional: Install STM32CubeMX Software for generating the initialization code for your microcontroller, according to specific requirements.

Now, one can deploy code in either of the following two ways:

  1. Setup a new project from scratch by generating initialization code using STM32CubeMX (depending on specific requirement of functionalities), and add the generated .c and .h files to the src/ and include/ folders respectively.
  2. Use one of our sample deployment snippets (available here), where we generate a minimum-requirement initialization code using STM32CubeMX (available here) and add our SeeDot-generated code over to this base sample. These sample snippets have been optimized further for speed, and can provide around 15% faster execution times.

Here, we focus on the latter option only. However, we encourage interested readers to try out the former option as well.

For deploying the codes, simply copy-paste the entire folder inside your Eclipse workspace directory, as shown below.

1) Open a new/existing workspace in Eclipse. (Screenshot by author)
2) Click on Import projects.. option. (Screenshot by author)
3) Click on Existing Projects into Workspace option for loading the downloaded folder. (Screenshot by author)
4) Browse to the downloaded folder containing our sample code and click Finish. (Screenshot by author)
5) The project import process is now complete, as seen from the Project Explorer menu. (Screenshot by author)

Setup the path to the downloaded compiler using the Eclipse GUI and build the project binaries using the compiler toolchain as follows.

1) Click on Project tab and choose Properties option, and navigate to Arm Toolchain Paths in MCU drop-down. (Screenshot by author)
2) Browse to the installation directory of the GNU ARM Embedded Toolchain mentioned earlier. (Screenshot by author)
3) Navigate to the bin/ directory inside and click Open. (Screenshot by author)
4) Click Apply and Close, again navigate to Project tab, and choose Build Project option to generate the cross-compiled binaries. (Screenshot by author)

As a final step, we need to setup a debugging configuration. We achieve this through OpenOCD, which allows us to control step-by-step debugging through the host PC itself.

1) Click on Run tab and choose Debug Configurations.. option, and double click on GDB OpenOCD Debugging option which would create a new configuration. (Screenshot by author)
2) Select the generated configuration and fill in the Main tab details, and finally click Apply. (Screenshot by author)
3) Navigate to the Debugger tab fill in the OpenOCD installation paths, and click Apply when done. (Screenshot by author)
4) Navigate to the Startup tab, set the configurations as shown and click Apply. (Screenshot by author)

Now, we’re set to run our compiled binary on our microcontroller of choice. Simply connect the board to the PC and click on Debug button. A console running OpenOCD should fire up, and halt immediately.

On clicking the Debug option, you should get the following prompt. Select the Switch option and the program should halt right at the first line of execution of main(). (Screenshot by author)

Final Deployment

Taking a scaled image as input (stored in main.h file as input_image variable), we execute the entire pipeline on our model of choice, and set the output directory (mentioned in the run_test() function of the main.c file) as desired. The scaling process for the arbitrary images can be done using the scale_image.py file. Simply run,

$ IS_QVGA_MONO=1 python scale_image.py --image_path <path_to_image> --scale <scaleForX value in m3dump/scales.h file> > output.txt

and copy-paste the entire text from output.txt file, and replace the input_image variable in main.h.

The model generates location (loc) and confidence (conf) bounding boxes for each region of the image that it considers as a face.

1) Click Resume option for continuing the execution of the full pipeline. (Screenshot by author)
(Left) We see two LEDs light up on connecting the board — a communications LED which flickers between red and green (signalling PC-device communication using OpenOCD) and a green power LED. (Right) On clicking Resume, we see a blue LED light up momentarily, signalling the beginning and the end of the execution of the core image processing pipeline on the microcontroller. This LED also serves as a measure of time spent by the device in processing the image. (Images by author)
2) After a while, we see a time elapsed counter, and a semihosting exit message, indicating that the results have been written to the mentioned output directory and now the execution is complete. (Screenshot by author)

Finally, we need to superimpose the obtained outputs over our original image to get the final desired output. For that, one can simply make use of the eval_fromquant.py script as:

$ IS_QVGA_MONO=1 python eval_fromquant.py --save_dir <output image directory> --thresh <threshold value for bounding boxes> --image_dir <input image directory containing original test image> --trace_file <path to output file generated by microcontroller> --scale <scaleForY value in m3dump/scales.h file>

which gives us our desired output image, a copy of which would be stored in the save_dir folder.

Sample image of our research group at Microsoft Research India. (Photo replicated with permission by author)
Corresponding output from RNNPool_Face_M4 (least latency) model (threshold: 0.45). (Photo edited with permission by author)
Corresponding output from RNNPool_Face_QVGA_Monochrome (highest accuracy) model (threshold: 0.60). (Photo edited with permission by author)

Conclusion

Thus, we have a functional face detection pipeline in C, which is executable on a tiny microcontroller. We can input an image to the microcontroller, execute the code to generate these conf and loc outputs, and send these outputs back to the host PC to obtain the final image with the bounding boxes superimposed as shown.

One can further extend this idea by integrating a camera module to the microcontroller for on-the-fly image capturing and processing. The exposition of the technicalities of that work would be out of the scope of this post, unfortunately. We experimented with a 2 megapixel OmniVision OV7670 camera module for this purpose, on top of the RNNPool_Face_M4 model. The code for the same can be checked out here (titled RNNPool_Face_M4 Camera).

A NUCLEO-F439 board with attached OV7670 camera. (Photo by author)
A sample shot from the device. (Photo by author)

This blog-post was a joint effort by a number of amazing researchers, research fellows and interns at Microsoft Research India. Credits go to Oindrila Saha, Aayan Kumar, Harsha Vardhan Simhadri, Prateek Jain and Rahul Sharma for contributing to the different sections of this write-up.

--

--

I’m a Research Fellow at Microsoft Research India, working on resource-efficient machine learning, graph analysis and theory.