Machine Learning in Production

A Tale of Model Quantization in TF Lite

Model optimization strategies and quantization techniques to help deploy machine learning models in resource-constrained environments.

Sayak Paul

Published in

Towards Data Science

8 min readMay 7, 2020

Interact with the dashboard of results here.

State of the art machine learning models are often bulky which often makes them inefficient for deployment in resource-constrained environments, like mobile phones, Raspberry Pis, microcontrollers, and so on. Even if you think that you might get around this problem by hosting your model on the Cloud and using an API to serve results — think of constrained environments where internet bandwidths might not be always high, or where data must not leave a particular device.

We need a set of tools that make the transition to on-device machine learning seamless. In this report, I will show you how TensorFlow Lite (TF Lite) can really shine in situations like this. We’ll cover model optimization strategies and quantization techniques supported by TensorFlow.

Check out the code on GitHub →

Thanks to Arun, Khanh, and Pulkit (Google) for sharing incredibly useful tips for this report.

Overview

In this article, we’ll cover the following topics –

Need for on-device machine learning
Model optimization strategies supported in TensorFlow
Quantization Techniques
Things to keep in mind while performing quantization

Need for on-device machine learning

In their talk TensorFlow Lite: ML for mobile and IoT devices (TF Dev Summit ‘20), Tim Davis and T.J. Alumbaugh emphasize the following:

Lower latency & close-knit interactions: There can be many critical applications where you might like to have zero-latency in predictions, self-driving cars for example. You might also need to keep all the internal interactions of your system really compact so that no extra latency is introduced.
Network connectivity: As I mentioned earlier, when you depend on a cloud-hosted model, you essentially constrain your application to depend on a certain level of network bandwidth that might not be always achievable.
Privacy-preserving: There can hard requirements on privacy, e.g. that the data must not leave the device.

To make the heavy-weight ML models deployable on tiny devices we need to optimize them, for instance, to fit a 1.9GB model into a 2GB application. To help ML developers and mobile application developers, the TensorFlow team has come up with two solutions:

Model optimization strategies supported in TensorFlow

TensorFlow, via TensorFlow Lite and the Model Optimization Toolkit, supports the following model optimization strategies today -

Quantization where you’d play with different lower precision formats to reduce the size of your models.
Pruning where you’d discard the parameters in your model that have very little significance on the model’s predictions.

In this article, we will focus on quantization.

Quantization Techniques

Generally, our machine learning models operate in float32 precision format. All the model parameters are stored in this precision format, which often leads to heavier models. The heaviness of a model has a direct correlation to the speed at which the model makes predictions. So, it might occur to you naturally that what if we could reduce the precision in which our models would operate, we could cut down on prediction times. That is what quantization does - it reduces the precision to lower forms like float16, int8, etc to represent the parameters of a model.

Quantization can be applied to a model in two flavors -

Post-training quantization is applied to a model after it is trained.
Quantization-aware training where a model is typically trained to compensate for the loss in precision that might be introduced due to quantization. When you reduce the precision of the parameters of your model, it can result in information loss and you might see some reduction in the accuracy of your model. In these situations, quantization-aware training can be really helpful.

We will see both these flavors in this report. Let’s get started!

Experimental settings

All of the experiments that we do in this report were performed on Colab. I used the flowers dataset for the experiments and fine-tuned a pre-trained MobileNetV2 network to start off with. Here’s the code that defines the network architecture -

The networks were trained for 10 epochs with a batch size of 32.

Performance with normal fine-tuning

We see the network trains reasonably well and comes in at 35.6 MB.

Quantizing the fine-tuned model

After you have trained a model intf.keras, the quantization part is just a matter of a few lines of code. So, the way you would do that is as follows -

You are first loading your model into a TFLiteConverter converter class, then specifying an optimization policy, and finally, you ask TFLite to convert your model with the optimization policy. Serializing the converted TF Lite file is straight-forward -

This form of quantization is also referred to as post-training dynamic range quantization. It quantizes the weights of your model to 8-bits of precision. Here you can find more details about this and other post-training quantization schemes.

A note on setting configuration options for the conversions

TF Lite allows us to specify a number of different configurations when converting our models. We saw once such configuration in the aforementioned code, where we specified the optimization policy.

Apart from tf.lite.Optimize.DEFAULT, there are other two policies available - tf.lite.Optimize.OPTIMIZE_FOR_SIZE & tf.lite.Optimize.OPTIMIZE_FOR_LATENCY. From the names, you can see that, based on the choice of policy, TF Lite will try to optimize the models accordingly.

We can specify other things like -

target_spec
representative_dataset

Learn more about the TFLiteConverter class here. It's important to note that these different configuration options allow us to maintain trade-offs between a model's prediction speed and it's accuracy. Here, you can find a number of trade-offs with respect to different post-training quantization schemes available in TF Lite.

Below we can see some useful statistics on this converted model.

We see a substantial reduction in the size of the model but that came with the cost of accuracy. Ideally, we wouldn’t want an accuracy loss this big in our converted model. This suggests that we need to explore other quantization schemes to further improve the accuracy of the converted model.

Quantization-aware training (QAT) with the same model

A good first approach here is to train your model in a way in which it would learn to compensate for the information loss that might be induced from quantization. With quantization-aware training, we can do just that. To train our network in a quantization-aware manner, we just add the following lines of code -

Now, you can train qat_model in the same way you would train a tf.keras model. Here you can find comprehensive coverage of QAT.

Below, we can see that this quantization aware model does slightly better than our previous model.

Brief comparison between the QAT & non-QAT model

In terms of model size, the QAT model is similar to the non-QAT model:

Remember these files are available under the “Files” tab of any run

But in terms of model training time, we see that the QAT model takes more time. This is because during QAT, fake quantization nodes are introduced in the model to compensate for the information loss, which makes the QAT model takes more time to converge.

This is important to keep in mind in cases where you are optimizing for time to convergence. If your training model takes a long time to train, then introducing QAT will further increase this time.

Quantizing a QAT model is exactly the same (we will use the same quantization configurations) that what we saw in the section above.

Let’s now compare the performance of the quantized version of the QAT model.

Evaluating the quantized QAT model

In the following table, we see that the quantized version of the QAT model indeed performs better than the previous model.

We clearly see that the model trained with QAT does not cause any accuracy drop. In the next section, we will keep both the models’ parameters as floats to see how far we can push the trade-off between model size and accuracy.

Quantizing to float models

To quantize our models to float precision, we just need to discard this line — converter.optimizations = [tf.lite.Optimize.DEFAULT]. This policy is particularly helpful if you were to take advantage of GPU delegates. Note that, float16 quantization is also supported in TensorFlow Lite. In the table below, we can see the size and accuracy of the models quantized using this scheme.

Although the size of these models has increased, we see the original performance of the models remains high. Note that converting a QAT model using this scheme is not recommended since, during QAT, the fake quantization ops that are inserted are in int precision. So, when we quantize a QAT model using this scheme the converted model can show inconsistencies.

Additionally, hardware accelerators, like the Edge TPU USB Accelerator, will not support float models.

Explore other quantization schemes & concluding thoughts

There are other post-training quantization techniques available as well, such as full integer quantization, float16 quantization, etc. This is where you can learn more about them. Keep in mind that the full integer quantization scheme might not always be compatible with a QAT model.

There are a number of SoTA pre-trained TF Lite models hosted for the developers to use for their applications and they can be found here:

For mobile developers who are looking to integrate machine learning in their applications, there are a number of example applications in TF Lite worth checking out. TensorFlow Lite also provides tooling for embedded systems and microcontrollers and you can learn more about it from here.

If you’d like to reproduce the results of this analysis, you can –