Tensor Quantization: The Untold Story

A close look at the implementation details of quantization in machine learning frameworks

Dhruv Matani
Towards Data Science

--

Co-authored with Naresh Singh.

Table Of Contents

Introduction

Throughout the rest of this article, we will try to answer the following questions with concrete examples.

  1. What do the terms scale and zero-point mean for quantization?
  2. What are the different types of quantization schemes?
  3. How to compute scale and zero-point for the different quantization schemes
  4. Why is zero-point important for quantization?
  5. How do normalization techniques benefit quantization

What do the terms scale and zero-point mean for quantization?

Scale: When quantizing a floating point range, one would typically represent a floating point range [Fmin..Fmax] in the quantized range [Qmin..Qmax]. In this case, the scale is the ratio of the floating point range and the quantized range.

We will see an example of how to compute it later.

Zero-point: The zero-point for quantization is the representation of floating point 0.0 in the quantized range. Specifically, the zero-point is a quantized value, and it represents the floating point value 0.0 for all practical purposes. We shall see how it’s computed with examples later, along with why such a representation is of practical interest to us.

Next, let’s take a look at the main quantization schemes used in practice and familiarize ourselves with how they are similar and how they differ.

Types of Quantization Schemes

When considering the types of quantization available for use during model compression, there are 2 main types to pick from

  1. Symmetric quantization: In this case, the zero-point is zero — i.e. 0.0 of the floating point range is the same as 0 in the quantized range. Typically, this is more efficient to compute at runtime but may result in lower accuracy if the floating point range is unequally distributed around the floating point 0.0.
  2. Affine (or asymmetric) quantization: This is the one that has a zero-point that is non-zero in value.

But before we jump into the details, let’s try to define what zero-point means.

Quantization Scale and Zero Point examples

Let’s start with a very simple example and build it up.

Example-1: Symmetric uint8 quantization

Let’s say we wish to map the floating point range [0.0 .. 1000.0] to the quantized range [0 .. 255]. The range [0 .. 255] is the set of values that can fit in an unsigned 8-bit integer.

To perform this transformation, we want to rescale the floating point range so that the following is true:

Floating point 0.0 = Quantized 0

Floating point 1000.0 = Quantized 255

This is called symmetric quantization because the floating point 0.0 is quantized 0.

Hence, we define a scale, which is equal to

Where,

In this case, scale = 3.9215

To convert from a floating point value to a quantized value, we can simply divide the floating point value by the scale. For example, the floating point value 500.0 corresponds to the quantized value

In this simple example, the 0.0 of the floating point range maps exactly to the 0 in the quantized range. This is called symmetric quantization. Let’s see what happens when this is not the case.

Example-2: Affine uint8 quantization

Let’s say we wish to map the floating point range [-20.0 .. 1000.0] to the quantized range [0 .. 255].

In this case, we have a different scaling factor since our xmin is different.

Let’s see what the floating point number 0.0 is represented by in the quantized range if we apply the scaling factor to 0.0

Well, this doesn’t quite seem right since, according to the diagram above, we would have expected the floating point value -20.0 to map to the quantized value 0.

This is where the concept of zero-point comes in. The zero-point acts as a bias for shifting the scaled floating point value and corresponds to the value in the quantized range that represents the floating point value 0.0. In our case, the zero point is the negative of the scaled floating point representation of -20.0, which is -(-5) = 5. The zero point is always the negative of the representation of the minimum floating point value since the minimum will always be negative or zero. We’ll find out more about why this is the case in the section that explains example 4.

Whenever we quantize a value, we will always add the zero-point to this scaled value to get the actual quantized value in the valid quantization range. In case we wish to quantize the value -20.0, we compute it as the scaled value of -20.0 plus the zero-point, which is -5 + 5 = 0. Hence, quantized(-20.0, scale=4, zp=5) = 0.

Example-3: Affine int8 quantization

What happens if our quantized range is a signed 8-bit integer instead of an unsigned 8-bit integer? Well, the range is now [-128 .. 127].

In this case, -20.0 in the float range maps to -128 in the quantized range, and 1000.0 in the float range maps to 127 in the quantized range.

The way we calculate zero point is that we compute it as if the quantized range is [0 .. 255] and then offset it with -128, so the zero point in the new range is

Hence, the zero-point for the new range is -123.

So far, we’ve looked at examples where the floating point range includes the value 0.0. In the next set of examples, we’ll take a look at what happens when the floating point range doesn’t include the value 0.0

The importance of 0.0

Why is it important for the floating point value 0.0 to be represented in the floating point range?

When using a padded convolution, we expect the border pixels to be padded using the value 0.0 in the most common case. Hence, it’s important for 0.0 to be represented in the floating point range. Similarly, if the value X is going to be used for padding in your network, you need to make sure that the value X is represented in the floating point range and that quantization is aware of this.

Example-4: The untold story — skewed floating point range

Now, let’s take a look at what happens if 0.0 isn’t part of the floating point range.

In this example, we’re trying to quantize the floating point range [40.0 .. 1000.0] into the quantized range [0 .. 255].

Since we can’t represent the value 0.0 in the floating point range, we need to extend the lower limit of the range to 0.0.

We can see that some part of the quantized range is wasted. To determine how much, let’s compute the quantized value that the floating point value 40.0 maps to.

Hence, we’re wasting the range [0 .. 9] in the quantized range, which is about 3.92% of the range. This could significantly affect the model’s accuracy post-quantization.

This skewing is necessary if we wish to make sure that the value 0.0 in the floating point range can be represented in the quantized range.

Another reason for including the value 0.0 in the floating point range is that efficiently comparing a quantized value to check if it’s 0.0 in the floating point range is very valuable. Think of operators such as ReLU, which clip all values below 0.0 in the floating point range to 0.0.

It is important for us to be able to represent the zero-point using the same data type (signed or unsigned int8) as the quantized values. This enables us to perform these comparisons quickly and efficiently.

Next, let’s take a look at how activation normalization helps with model quantization. We’ll specifically focus on how the standardization of the activation values allows us to use the entire quantized range effectively.

Quantization and Activation Normalization

Batch/Layer Normalization changes the activation tensor to have zero mean and unit variance per channel or per layer.

Suppose we have an input tensor with a floating point range of [2000.0 .. 4000.0]. This is what the quantized range would look like.

We observe that half of the quantized range [-127 .. -1] is unused. This is problematic since we’re quantizing the entire floating point range using just 7 of the available 8 bits. This will undoubtedly result in a higher quantization error and reduced model accuracy. To address this let’s apply layer normalization to the activation tensor.

After applying layer normalization to an activation tensor, the activation tensor will have a floating point range of [-2.0 .. 2.0]. This can be represented in the signed int8 range [-128 .. 127]. To ensure symmetry of the distribution, we restrict the quantized range to be [-127 .. 127].

Hence, normalization avoids holes or unused parts in the quantized range.

Conclusion

We saw what affine (asymmetric) and symmetric quantization are and how they differ. We also learned about what scale and zero-point mean and how to compute them for both these types of quantization schemes.

Next, we saw the need to include float 0.0 in the floating point range and why and how this is done in practice. This results in a downside, namely wasted space in the quantized range.

Lastly, we saw how normalization helps quantization by bringing the activations in a fixed range and avoiding wasted space in the quantized range. In fact, the 0 mean based normalization can help convert affine quantization to symmetric quantization, and that can speed things up during inference.

All images in this post are created by the author(s).

--

--

Machine Learning, PyTorch, CNNs, Transformers, Vision, Speech, Text AI. On-Device AI, Model Optimization, ML and Data Infrastructure. My views are my own.