Integer-Only Inference for Deep Learning in Native C
Tutorial: converting a deep neural network for deployment on low-latency, low-compute devices via uniform quantization and the fixed-point representation.
Integer-only inference allows for the compression of deep learning models for deployment on low-compute and low-latency devices. Many embedded devices are programmed using native C and do not support floating-point operations and dynamic allocation. Nevertheless, small deep learning models can be deployed to such devices with an integer-only inference pipeline through uniform quantization and the fixed-point representation.
We employed these methods to deploy a deep reinforcement learning (RL) model on a network interface card (NIC) (Tessler et al. 2021[1]). Successfully deploying the RL model required inference latency of O(microseconds) on a device with no floating-point operation support. The small latency constraints inhibited running the model via a GPU. To achieve our goal, we used Post-Training Quantization (PTQ) techniques specified by Hao et al. 2020[2] with TensorRT’s pytorch-quantization package.
In this tutorial, we will learn about Quantization and PTQ techniques following Hao et al. 2020[2]. In addition, we will learn about the fixed-point representation and give a detailed example of converting a PyTorch classifier trained on MNIST to native C for integer-only inference with minimal performance loss.
Table of Contents
· Quantization
· Post-Training Quantization
∘ Calibration
· Integer Only Inference
∘ The Fixed-Point Representation
∘ Fixed-Point Arithmetics
· Tutorial — integer-only inference in native C for MNIST classification
∘ Model Training and Quantization in Python
∘ Model Inference in Native C-Code
∘ Model Evaluation
· Conclusion
Quantization
Quantization is the process of mapping numerical values from a large set (often continuous) to a smaller set. In our case, we are interested in reducing a neural network’s size and speeding up computations such that it can fit and run inference on low-compute low-memory devices. We will focus on uniform integer quantization, such that we can perform matrix multiplications and convolutions in the integer domain. More specifically, we will quantize a network’s inputs and parameters from float32 to int8.
Uniform Quantization — in uniform quantization, an input x ∈ [β, α] is transformed to the range
where b is the number of bits. Values outside the input range are clipped to the nearest bound. In our case, the transformed range is [-128, 127], and we will use [-127, 127] for symmetry.
Uniform quantization is done via two functions:
- QUANTIZE - maps float32 values to int8 range followed by rounding and clipping (float32 → int8)
- DEQUANTIZE — maps int32 values to float32
Assuming we want to minimize the number of operations, we will implement these operations using scale quantization.
In scale quantization, the input range is x ∈ [-α, α], and the maximum α value (amax) is calibrated to maximize precision. Once α is calibrated, the mapping is performed by multiplying/dividing by a scale factor s.
With scale quantization, our operations are defined as follows:
Tensor Quantization Granularity
Tensor granularity refers to sharing quantization parameters (e.g., scale factor). There is a trade-off between precision and computational costs, where scaling each tensor element with a different scale factor will result in the highest precision and computational costs — in contrast, scaling the entire tensor with a single factor results in the lowest precision and computational costs.
Linear Layer Quantization
We quantize layer inputs with a per-tensor granularity (all tensor elements share the same scale factor). In contrast, the layer’s weights are quantized with a per-row granularity (each row has a different scale factor). This choice allows to factor out the scale factors and to perform matrix multiplication with integers.
Bias quantization is easily achieved by concatenating a vector of ones to the input and adding another dimension to a layer’s weights.
Convolution Layer Quantization
As with the linear case, we can computationally benefit from factoring out the scale factors. This is done by choosing a per-tensor granularity for layer inputs and per-channel granularity for layer weights.
Post-Training Quantization
In post-training quantization, we calibrate the scale factors for each model layer by constructing histograms (one for layer inputs and one for parameters) of absolute values. The inference is run on a sample dataset for layer inputs, while layer parameter calibration can be done offline.
Calibration
The calibration part selects the α that minimizes quantization precision loss. There are three main methods to choose the proper α per histogram:
- Max — calibrate using the maximum of the absolute value distribution
- Entropy — minimize KL divergence between the quantized distribution and the original distribution
- Percentile — select the bin corresponding to the k-th percentile of the absolute value distribution
Once calibration is finished, we can construct a quantized linear layer in inference.
Integer Only Inference
Up until this point, we discussed quantization for performing linear operations such as convolutions and matrix-multiplications with integers. On the other hand, non-linear activations are not suited for uniform quantization and should be approximated. Non-linear functions can be approximated with lookup tables, piecewise functions, and other methods, which deserve an article of their own. Instead, we will not discuss non-linear approximations and use the ReLU activation function as it does not require approximation.
However, we are still left with float32 as an input/output to the Quantize/Dequantize operations. When we multiply/divide values with the scale factors during the Quantize/Dequantize operations, we cannot avoid dealing with fractions. To handle such operations with integers, we will use the fixed-point representation.
The Fixed-Point Representation
The fixed-point representation method is a way to express fractions with integers. We can split the K-bits making up an integer to represent the integer part of a number and the fractional part of the number. Using the sign-magnitude format, we reserve 1-bit representing the sign and the other bits representing the fraction. The radix splits the remaining K-1 bits to M most significant bits (MSB) representing the integer value and N least significant bits (LSB) representing the fraction. The choice of M and N results in a trade-off between representation range and precision.
Conversion between float and its fixed-point representation is done by multiplication/division by 2 to the power of the fixed-point value.
Fixed-Point Arithmetics
Three rules of thumb for fixed-point arithmetics:
- The sum of two fixed-point numbers is a fixed-point number.
2. The product of an integer with a fixed-point number is a fixed-point number.
3. The product of two fixed-point numbers divided by two to the power of the fixed-point value is a fixed-point number.
The fixed point representation replaces the remaining floating-point operations with integers.
Tutorial — integer-only inference in native C for MNIST classification
We will train a simple classifier on the MNIST dataset in PyTorch. Next, we will quantize the network’s parameters to int8 and calibrate their scale factors. Finally, we will write an integer-only inference code in native C.
Model Training and Quantization in Python
The model — A Multilayer Perceptron (MLP) with two hidden layers (128, 64 respectively) and ReLU trained on the MNIST dataset. The MLP contains no biases to minimize the model parameters and computational overhead.
Let’s define a simple MLP model in PyTorch.
and train it for ten epochs
Epoch: 1 — train loss: 0.35650 validation loss: 0.20097
Epoch: 2 — train loss: 0.14854 validation loss: 0.13693
Epoch: 3 — train loss: 0.10302 validation loss: 0.11963
Epoch: 4 — train loss: 0.07892 validation loss: 0.11841
Epoch: 5 — train loss: 0.06072 validation loss: 0.09850
Epoch: 6 — train loss: 0.04874 validation loss: 0.09466
Epoch: 7 — train loss: 0.04126 validation loss: 0.09458
Epoch: 8 — train loss: 0.03457 validation loss: 0.10938
Epoch: 9 — train loss: 0.02713 validation loss: 0.09077
Epoch: 10 — train loss: 0.02135 validation loss: 0.09448
Evaluate model on test data
Accuracy: 97.450%
Model Quantization — TensorRT is an SDK for high-performance deep-learning inference. It contains pytorch-quantization, which allows for straightforward quantization (fake, quantization-aware training, PTQ) of PyTorch models.
Let’s begin by importing the relevant modules.
quant_nn
contains quantized versions of PyTorch layers such as nn.Linear.calib
is a calibration module for constructing Histograms and calculating scale factors.quant_modules
allows for the dynamic replacing of PyTorch layers with their quantized versions.QuantDescriptor
defines how to quantize a tensor.
The first step is to define quantization using a histogram calibrator. We then set our quantized linear layer with the desired quantization description. Finally, monkey-patch PyTorch modules with quantized versions by calling initialize().
We load the trained MLP model only after defining the quantization scheme and monkey patching PyTorch modules.
Let’s define a function to input statistics for calibration during inference. First, we disable quantization to construct a histogram from the float values. We then perform inference and re-enable quantization for cases where we want to run an inference with a quantized model.
Now let’s define a function to compute the amax values (scale factor = 127 / amax).
To minimize the number of operations at inference, we calculate the scale factor values offline. To avoid division in fixed-point, we can invert them and multiply the inverted values in C.
It is possible to combine the Dequantize scale factors into a single variable. Still, it may reduce accuracy as the resulting number might be very small for fixed-point 16 (see the Dequantize section in the tutorial for an example).
Once we have defined our functions, we can run the code.
Model Inference in Native C-Code
Implementing an MLP model requires the definition of:
- Matrix-Multiplication
- Fixed-point Multiplication
- Quantize
- Dequantize
- ReLU
- argmax (replaces softmax for inference)
- Linear Layer
Assuming the neural network’s architecture and parameters are pre-determined, and we cannot use dynamic allocation, we will not define general structures for matrices and tensors. Instead, in this tutorial, we will treat matrices/tensors as flattened 1D arrays and use their shapes to apply operations.
Matrix-multiplication
We perform matrix multiplication between two flattened int8 matrices. To avoid overflow, the result is stored in a larger integer than int8.
Fixed-Point Multiplication
Simple multiplication between two fixed-point 16 numbers will look like this:
Therefore, we need to divide the result by 2 to the power of 16.
In C, the left shift operator can be seen as multiplication by two to the power and the right shift operator as division by 2 to the power. We can define a fixed-point multiplication function in C, where both inputs are represented in the same fixed-point value as follows:
There are two major issues when multiplying fixed-point values
- Rounding
- Overflow
Rounding — By default, multiplying/dividing using left and right shifts results in truncation (flooring). A simple rounding method is to round to the nearest integer via the round half up method.
Fixed-point multiplication with rounding looks like this:
Overflow — It is straightforward to see that the product of two large fixed-point numbers can easily overflow before right-shifting. A simple solution is to store the value in a larger type (such as int64), but in some cases, the fixed-point number is already represented using the largest available type. An alternative is to perform fixed-point multiplication by parts. Meaning we split the fixed-point number into an integer and a fractional component under the assumption that the product of both integers will not overflow.
A fixed-point multiplication by parts implementation looks like this:
Now that we covered fixed-point multiplication, we can code the Quantize function.
Quantize
The Quantize function multiplies two fixed-point numbers, the layer’s input and the scale factor, followed by clipping to the int8 range [-127, 127]. We reduce the risk of overflow by comparing the input to the product of the inverted scale factor and the border of our clipping range before performing the fixed-point multiplication. Contrarily to standard fixed-point multiplication, in Quantize, we want the product to be a rounded integer. Therefore we will not convert the output to fixed-point.
Dequantize
The Dequantize operation requires multiplying an integer (product of matrix-multiplication with int8) and the inverted scale factors of the layer’s input (a scalar) and the layer’s weight (a vector as we quantized with per-row granularity). Meaning we need to mix regular and fixed-point multiplication. In general, multiplying an integer with a fixed-point number has a high risk of overflow. On the other hand, when the fixed-point numbers represent small fractions (and they do in our case), multiplying them together can result in precision loss. This occurs when their product before right shifting may be smaller than the value we are right shifting by. As a result, their output will erroneously be zeroed.
To illustrate this point, in the example below, we multiply values with different sizes with two integers: 50, 500. We compare multiplying before right shifting or after.
a: 0.001000 b: 0.000400 int: 50, c_before: 1 c_after: 0
a: 0.001000 b: 0.000400 int: 500, c_before: 13 c_after: 0
a: 0.100000 b: 0.000400 int: 50, c_before: 130 c_after: 150
a: 0.100000 b: 0.000400 int: 500, c_before: 1300 c_after: 1500
a: 0.001000 b: 0.400000 int: 50, c_before: 1300 c_after: 1300
a: 0.001000 b: 0.400000 int: 500, c_before: 13000 c_after: 13000
a: 0.100000 b: 0.400000 int: 50, c_before: -14 c_after: 131050
a: 0.100000 b: 0.400000 int: 500, c_before: -140 c_after: 1310500
We can see that for small a and b values, we benefit from multiplying before right shifting, but we overflow when their values are larger. On the other hand, we get slightly different results when we multiply after (precision loss). Finally, only when a=0.001 and b=0.4 do both methods output the same result.
A simple solution is to calculate the product of the scale factors and compare it to 1 (in fixed-point). When they are larger than one, we risk overfitting, and we should multiply the matrix values after fixed-point multiplication. Otherwise, we multiply the matrix values within the fixed-point multiplication.
Finally, we define Dequantize as:
ReLU
The C implementation of ReLU is straightforward
argmax
We calculate argmax over the columns of an NxM matrix (where N is the batch size and M is the number of labels).
Linear Layer
A linear layer is composed of all the previously defined parts:
MLP — combining several linear layers together
For this tutorial, the model parameters and scale factors are coded in a header file, and the run_mlp function is called from python using C-types.
Model Evaluation
Evaluating integer-only C model on test dataAccuracy: 97.27%
which is 0.18% less than our float model in python.
Conclusion
In this article, we discussed Post-Training Quantization, the fixed-point representation, and how to run integer-only inference in native C using a classifier trained with full precision in PyTorch.
For the complete code, including a convolutional neural network version, visit the Github repository: https://github.com/benja263/Integer-Only-Inference-for-Deep-Learning-in-Native-C
Thank you for Reading!
References
[1] Tessler, C., Shpigelman, Y., Dalal, G., Mandelbaum, A., Kazakov, D. H., Fuhrer, B., Chechik, G., & Mannor, S. (2021). Reinforcement Learning for Datacenter Congestion Control. http://arxiv.org/abs/2102.09337
[2] Wu, H., Judd, P., Zhang, X., Isaev, M., & Micikevicius, P. (2020). Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. http://arxiv.org/abs/2004.09602