TFLite Micro vs GLOW AOT

Comparing TinyML Frameworks

Published in

Towards Data Science

7 min readJun 21, 2021

Miscellaneous Microcontroller development boards

The two frameworks I use for TinyML inference are TensorFlow Lite for Microcontrollers and GLOW (more specifically, the GLOW Ahead of Time (AOT) compiler). Since I haven’t really seen a comparison between the two, I decided to compare the implementation of both frameworks and perform some benchmarks.

Very Basic Overview

When deploying a trained model to a Microcontroller (MCU), there are two elements required:

1) Quantization, which optimizes the model for size and latency. While, technically speaking, it is not strictly required, there are most likely very few, if any, models that can run on an MCU without quantization, and

2) Inference engine, which is used to perform the actual inference on the target.

While both frameworks offer tools to quantize models, I will focus on the inference engine since the approach taken by the two frameworks for the inference is very different.

TensorFlow converts the model to a FlatBuffer containing serialized steps required to perform the inference, used along with a library that runs on target MCU, which interprets the FlatBuffer. The information contained in the FlatBuffer is the weights and operations used in the model. To perform the operations efficiently, the TFlite Micro library has optimized kernels for different targets.

GLOW generates compiled code to infer the model on the target, hence the name “Ahead of Time” compiler. Deploying the model to the MCU consists of a bundle containing a compiled object file, header file, and weight files that can be used dynamically (during runtime) or statically (when compiling for the target).

Comparing the two approaches

The two approaches taken by each of the frameworks have their benefits. I will try and outline the advantages and distinctions below.

I am not going to get into the whole “using OOP for embedded” debate. I’ll just say that if seeing a .cpp or .cc file in your embedded project makes you uncomfortable, then you need not read further because GLOW is the framework for you. TensorFlow uses C++, which, granted, is an embedded-friendly implementation, but it is still C++.

Memory Usage

You might be wondering why I am not including memory usage in the benchmarks below; that is because I don’t feel it would be an apples-to-apples comparison.

The bundle output from GLOW contains the exact sizes required for persistent and mutable memory, and since the inference code is already compiled, it too can easily be calculated. This is very deterministic, and it allows you to easily place your constants and/or your program in RAM; this can be especially useful if you are using an ARM Cortex-M7 with tightly coupled memory. Also, the final binaries generated using GLOW were smaller.

TensorFlow, on the other hand, is a little less deterministic. The FlatBuffer size is the only exact value you get. Mutable memory is probably my biggest gripe as you need to allocate an area in RAM titled the tensor_arena and, to quote the TensorFlow docs, “The size required will depend on the model you are using, and may need to be determined by experimentation.”. As far as program size, within your program, you specify which operations your model will use, and only those will be included in your executable.

Being an embedded software developer (read: control freak), I much prefer the GLOW approach. However, if you plan on running multiple models that use similar operators, TensorFlow might have the upper hand since you will be using the same instruction memory for all models.

Deployment

The nature of the ML beast is constantly upgrading and updating; therefore, in-field upgradability is a must. Here I prefer the TensorFlow implementation. When your model changes, as long as the shape of your input and output stay the same and you included all the operations in your original code, all you need to do is replace your FlatBuffer.

This means you can store the Flatbuffer in a filesystem and use whatever communication protocols you have implemented to update your model. From a security standpoint, you can update your model without changing your executable. Not that it removes all security concerns, but it is still beneficial. That being said, the GLOW bundle is isolated enough that, with some thought, one can design a system that allows only the model to be upgraded.

While I prefer TensorFlow to update models in the field, how beneficial this is will depend on the use case. For example, if the use case won’t require many updates, or if the updates will be significant (requiring changes to the input or output shape or introducing new operations), the benefit might be less effective.

Portability

There are two different elements to portability: the framework used for training and the target hardware. Technically, GLOW is part of PyTorch but can easily be used with models trained in TensorFlow (but be warned, using a model trained in PyTorch with TensorFlow Lite is a little more involved.)

Target architecture portability in TensorFlow is pretty simple. If you can compile C++ code for your target, then you should be able to compile TensorFlow Lite for Microcontrollers. It probably won’t be too optimized, but it’ll work. The library does contain optimized kernels for a few different architectures (for example, ARM CMSIS NN or Cadence Xtensa). You can also write custom kernels optimized for your target hardware. For example, suppose your target hardware has a matrix multiplication engine, and your model uses many depth-wise convolutions. In that case, you can create a custom implementation of the depth-wise convolution kernel while using the rest of the library.

GLOW uses LLVM as its backend compiler. If your target is supported by the LLVM compiler, it should work. I have no idea how well it will be optimized- for ARM, it seems to do a decent job (see benchmarks below). If your target architecture is not supported, you can always add it; LLVM is open source, after all.

I did some light experimenting with writing custom implementations of a TensorFlow kernel, it was pretty simple, and the library includes unit tests.

I mostly use TensorFlow for training and ARM Cortex-M for inference, so both frameworks work well for me.

Benchmarks

MLCommons recently released MLPerf Tiny, which makes benchmarking really simple. However, it should be noted that the benchmarks performed here can still be further optimized and in no way represent the highest achievable results, but rather show the performance of each framework with minimal digging under the hood.

The Hardware

I ran the benchmarks on two different MCUs using the manufacturer’s dev. boards

LPC55S69 — Has a dual-core Arm Cortex-M33 that can run at up to 150 MHz. It also has a hardware accelerator which is essentially a mini DSP. The configuration I like using is the Mini DSP for preprocessing (I experimented with writing custom TF kernels for the DSP, but I feel it was more efficient to use it for preprocessing) with one of the cores as an inference engine. While both cores are Cortex-M33, only one core supports SIMD, which significantly accelerates inference.
i.MX RT1010 — Has an Arm Cortex-M7 that can run at up to 500 MHz, and while it has pretty limited IOs, it costs only $0.99 in large quantities. For applications where powerful computing and not many peripherals are required, this is an attractive option (an inference engine comes to mind).

The Setup

From MLPerf, I used v0.5 of the keyword spotting benchmark and the EEMBC Windows 10 runner v3.0.6, and I ran the median performance and accuracy tests.

For software, I used the MCUXpresso IDE along with NXP’s SDK. I am currently porting the different implementations into a single portable project. Once done, it can be found in the project repo. The MCUXpresso SDK version I used for both MCUs is 2.9.1.

For the TensorFlow benchmarks, I compiled the library with no optimization and with -O3 optimization. When possible, I tested with the values stored in flash and in RAM. For the i.MX RT1010 I configured the RAM as DTCM to speed up the inference.

The Results

The results can be seen in the table below

Benchmark Results

It seems that GLOW was better at optimizing for the Cortex-M7 than for the Cortex-M33; this might be because the Cortex- M7 is more suitable (and so more effort was put into optimization) or just that it is a more mature architecture. If optimization is highly critical for your application, I’d recommend trying both frameworks on your model. If the results are similar, it should be easier to manually optimize TensorFlow further.

Conclusion

As detailed above, both frameworks have their advantages, so you should choose what is best based on your application.

Both frameworks are straightforward to use. So if you are an embedded software engineer looking to try machine learning or a machine learning engineer looking to try the embedded side of things, I highly recommend giving this a shot.

Official Benchmarks

While writing this article, the first batch of the formal benchmarks was released. Of note is the Cortex-M4 results, which yielded better results than my tests on the Cortex-M33 (which supports the same DSP instruction set as the Cortex-M4) even though the Cortex-M33 I used ran at a higher clock speed. I tried to verify the results by compiling the Mbed reference submission for the LPC55S69 (Cortex-M33) with no luck. I was, however, able to run the reference submission on the NUCLEO-L552ZE-Q board, which is also a Cortex-M33, and the Throughput was 2.693 inf./sec. Which is more in line with the results I got.

This leads me to think that the optimizations for the Cortex-M33 might not be mature enough yet, or alternatively, the DSP instructions on the Cortex-M33 use more clock cycles. Yet, as I have said earlier, these benchmarks do not indicate the absolute capabilities and the out-of-the-box experiences, and I am sure the results can be further optimized.

Image by author