Integer-Only Inference for Deep Learning in Native C

Tutorial: converting a deep neural network for deployment on low-latency, low-compute devices via uniform quantization and the fixed-point representation.

Benjamin Fuhrer
Towards Data Science

--

Integer-only inference allows for the compression of deep learning models for deployment on low-compute and low-latency devices. Many embedded devices are programmed using native C and do not support floating-point operations and dynamic allocation. Nevertheless, small deep learning models can be deployed to such devices with an integer-only inference pipeline through uniform quantization and the fixed-point representation.

We employed these methods to deploy a deep reinforcement learning (RL) model on a network interface card (NIC) (Tessler et al. 2021[1]). Successfully deploying the RL model required inference latency of O(microseconds) on a device with no floating-point operation support. The small latency constraints inhibited running the model via a GPU. To achieve our goal, we used Post-Training Quantization (PTQ) techniques specified by Hao et al. 2020[2] with TensorRT’s pytorch-quantization package.

In this tutorial, we will learn about Quantization and PTQ techniques following Hao et al. 2020[2]. In addition, we will learn about the fixed-point representation and give a detailed example of converting a PyTorch classifier trained on MNIST to native C for integer-only inference with minimal performance loss.

Table of Contents

· Quantization
· Post-Training Quantization
Calibration
· Integer Only Inference
The Fixed-Point Representation
Fixed-Point Arithmetics
· Tutorial — integer-only inference in native C for MNIST classification
Model Training and Quantization in Python
Model Inference in Native C-Code
Model Evaluation
· Conclusion

Quantization

Quantization is the process of mapping numerical values from a large set (often continuous) to a smaller set. In our case, we are interested in reducing a neural network’s size and speeding up computations such that it can fit and run inference on low-compute low-memory devices. We will focus on uniform integer quantization, such that we can perform matrix multiplications and convolutions in the integer domain. More specifically, we will quantize a network’s inputs and parameters from float32 to int8.

Uniform Quantization — in uniform quantization, an input x ∈ [β, α] is transformed to the range

where b is the number of bits. Values outside the input range are clipped to the nearest bound. In our case, the transformed range is [-128, 127], and we will use [-127, 127] for symmetry.

Uniform quantization is done via two functions:

  1. QUANTIZE - maps float32 values to int8 range followed by rounding and clipping (float32 → int8)
  2. DEQUANTIZE — maps int32 values to float32

Assuming we want to minimize the number of operations, we will implement these operations using scale quantization.

In scale quantization, the input range is x ∈ [-α, α], and the maximum α value (amax) is calibrated to maximize precision. Once α is calibrated, the mapping is performed by multiplying/dividing by a scale factor s.

Scale factor calculation, where the alpha max value (amax) is calculated during calibration and b=8-bits.

With scale quantization, our operations are defined as follows:

Tensor Quantization Granularity

Tensor granularity refers to sharing quantization parameters (e.g., scale factor). There is a trade-off between precision and computational costs, where scaling each tensor element with a different scale factor will result in the highest precision and computational costs — in contrast, scaling the entire tensor with a single factor results in the lowest precision and computational costs.

Linear Layer Quantization

Linear layer as implemented in PyTorch, X is the NxK input tensor, W is the MxK weight tensor, b is the M-length bias vector, and Y is the NxM output tensor.

We quantize layer inputs with a per-tensor granularity (all tensor elements share the same scale factor). In contrast, the layer’s weights are quantized with a per-row granularity (each row has a different scale factor). This choice allows to factor out the scale factors and to perform matrix multiplication with integers.

Bias quantization is easily achieved by concatenating a vector of ones to the input and adding another dimension to a layer’s weights.

Convolution Layer Quantization

As with the linear case, we can computationally benefit from factoring out the scale factors. This is done by choosing a per-tensor granularity for layer inputs and per-channel granularity for layer weights.

Following PyTorch’s implementation, input tensor X has dimensions N, C_in, L, output tensor Y has dimensions N, C_out, L_prime, then a 1D convolution with stride 1, dilation 1, and no padding.

Post-Training Quantization

In post-training quantization, we calibrate the scale factors for each model layer by constructing histograms (one for layer inputs and one for parameters) of absolute values. The inference is run on a sample dataset for layer inputs, while layer parameter calibration can be done offline.

Calibration

The calibration part selects the α that minimizes quantization precision loss. There are three main methods to choose the proper α per histogram:

  1. Max — calibrate using the maximum of the absolute value distribution
  2. Entropy — minimize KL divergence between the quantized distribution and the original distribution
  3. Percentile — select the bin corresponding to the k-th percentile of the absolute value distribution
Histogram of the absolute values of the inputs to the 3rd layer of ResNet 50 and calibrated ranges. Image is taken from (Hao et al. 2020) [2]

Once calibration is finished, we can construct a quantized linear layer in inference.

Integer Only Inference

Up until this point, we discussed quantization for performing linear operations such as convolutions and matrix-multiplications with integers. On the other hand, non-linear activations are not suited for uniform quantization and should be approximated. Non-linear functions can be approximated with lookup tables, piecewise functions, and other methods, which deserve an article of their own. Instead, we will not discuss non-linear approximations and use the ReLU activation function as it does not require approximation.

However, we are still left with float32 as an input/output to the Quantize/Dequantize operations. When we multiply/divide values with the scale factors during the Quantize/Dequantize operations, we cannot avoid dealing with fractions. To handle such operations with integers, we will use the fixed-point representation.

The Fixed-Point Representation

The fixed-point representation method is a way to express fractions with integers. We can split the K-bits making up an integer to represent the integer part of a number and the fractional part of the number. Using the sign-magnitude format, we reserve 1-bit representing the sign and the other bits representing the fraction. The radix splits the remaining K-1 bits to M most significant bits (MSB) representing the integer value and N least significant bits (LSB) representing the fraction. The choice of M and N results in a trade-off between representation range and precision.

Fixed-Point 16 representation with a 32-bit signed integer. 16 LSB represents the fraction, and 15 MSB represents the integer.

Conversion between float and its fixed-point representation is done by multiplication/division by 2 to the power of the fixed-point value.

Conversion between a floating-point number and its fixed-point 16 representation.

Fixed-Point Arithmetics

Three rules of thumb for fixed-point arithmetics:

  1. The sum of two fixed-point numbers is a fixed-point number.

2. The product of an integer with a fixed-point number is a fixed-point number.

3. The product of two fixed-point numbers divided by two to the power of the fixed-point value is a fixed-point number.

The fixed point representation replaces the remaining floating-point operations with integers.

Schema of an integer-only linear layer with quantized layer inputs and weights

Tutorial — integer-only inference in native C for MNIST classification

We will train a simple classifier on the MNIST dataset in PyTorch. Next, we will quantize the network’s parameters to int8 and calibrate their scale factors. Finally, we will write an integer-only inference code in native C.

Model Training and Quantization in Python

The model — A Multilayer Perceptron (MLP) with two hidden layers (128, 64 respectively) and ReLU trained on the MNIST dataset. The MLP contains no biases to minimize the model parameters and computational overhead.

Let’s define a simple MLP model in PyTorch.

and train it for ten epochs

Epoch: 1 — train loss: 0.35650 validation loss: 0.20097
Epoch: 2 — train loss: 0.14854 validation loss: 0.13693
Epoch: 3 — train loss: 0.10302 validation loss: 0.11963
Epoch: 4 — train loss: 0.07892 validation loss: 0.11841
Epoch: 5 — train loss: 0.06072 validation loss: 0.09850
Epoch: 6 — train loss: 0.04874 validation loss: 0.09466
Epoch: 7 — train loss: 0.04126 validation loss: 0.09458
Epoch: 8 — train loss: 0.03457 validation loss: 0.10938
Epoch: 9 — train loss: 0.02713 validation loss: 0.09077
Epoch: 10 — train loss: 0.02135 validation loss: 0.09448
Evaluate model on test data
Accuracy: 97.450%

Model Quantization — TensorRT is an SDK for high-performance deep-learning inference. It contains pytorch-quantization, which allows for straightforward quantization (fake, quantization-aware training, PTQ) of PyTorch models.

Let’s begin by importing the relevant modules.

  • quant_nn contains quantized versions of PyTorch layers such as nn.Linear.
  • calib is a calibration module for constructing Histograms and calculating scale factors.
  • quant_modules allows for the dynamic replacing of PyTorch layers with their quantized versions.
  • QuantDescriptor defines how to quantize a tensor.

The first step is to define quantization using a histogram calibrator. We then set our quantized linear layer with the desired quantization description. Finally, monkey-patch PyTorch modules with quantized versions by calling initialize().

We load the trained MLP model only after defining the quantization scheme and monkey patching PyTorch modules.

Let’s define a function to input statistics for calibration during inference. First, we disable quantization to construct a histogram from the float values. We then perform inference and re-enable quantization for cases where we want to run an inference with a quantized model.

Now let’s define a function to compute the amax values (scale factor = 127 / amax).

To minimize the number of operations at inference, we calculate the scale factor values offline. To avoid division in fixed-point, we can invert them and multiply the inverted values in C.

It is possible to combine the Dequantize scale factors into a single variable. Still, it may reduce accuracy as the resulting number might be very small for fixed-point 16 (see the Dequantize section in the tutorial for an example).

Once we have defined our functions, we can run the code.

Model Inference in Native C-Code

Implementing an MLP model requires the definition of:

  1. Matrix-Multiplication
  2. Fixed-point Multiplication
  3. Quantize
  4. Dequantize
  5. ReLU
  6. argmax (replaces softmax for inference)
  7. Linear Layer

Assuming the neural network’s architecture and parameters are pre-determined, and we cannot use dynamic allocation, we will not define general structures for matrices and tensors. Instead, in this tutorial, we will treat matrices/tensors as flattened 1D arrays and use their shapes to apply operations.

Matrix-multiplication

We perform matrix multiplication between two flattened int8 matrices. To avoid overflow, the result is stored in a larger integer than int8.

Y = XW, where X is an NxK matrix and W is a KxM matrix, resulting in Y an NxM matrix. The Multiply and Accumulate operation is stored in a large integer type to avoid overflow

Fixed-Point Multiplication

Simple multiplication between two fixed-point 16 numbers will look like this:

Therefore, we need to divide the result by 2 to the power of 16.

In C, the left shift operator can be seen as multiplication by two to the power and the right shift operator as division by 2 to the power. We can define a fixed-point multiplication function in C, where both inputs are represented in the same fixed-point value as follows:

There are two major issues when multiplying fixed-point values

  1. Rounding
  2. Overflow

Rounding — By default, multiplying/dividing using left and right shifts results in truncation (flooring). A simple rounding method is to round to the nearest integer via the round half up method.

Fixed-point multiplication with rounding looks like this:

Fixed-Point multiplication with the round half up method.

Overflow — It is straightforward to see that the product of two large fixed-point numbers can easily overflow before right-shifting. A simple solution is to store the value in a larger type (such as int64), but in some cases, the fixed-point number is already represented using the largest available type. An alternative is to perform fixed-point multiplication by parts. Meaning we split the fixed-point number into an integer and a fractional component under the assumption that the product of both integers will not overflow.

A fixed-point multiplication by parts implementation looks like this:

Now that we covered fixed-point multiplication, we can code the Quantize function.

Quantize

The Quantize function multiplies two fixed-point numbers, the layer’s input and the scale factor, followed by clipping to the int8 range [-127, 127]. We reduce the risk of overflow by comparing the input to the product of the inverted scale factor and the border of our clipping range before performing the fixed-point multiplication. Contrarily to standard fixed-point multiplication, in Quantize, we want the product to be a rounded integer. Therefore we will not convert the output to fixed-point.

QUANTIZE function implementation with fixed-point. FXP_VALUE = 16, INT8_MAX_VALUE = 127, ROUND_CONST = (1 << FXP_VALUE — 1). Note — we do not shift right the final value as in the fixed-point multiplication implementation as the fixed-point representation of (input*scale_factor) represents a rounded integer.

Dequantize

The Dequantize operation requires multiplying an integer (product of matrix-multiplication with int8) and the inverted scale factors of the layer’s input (a scalar) and the layer’s weight (a vector as we quantized with per-row granularity). Meaning we need to mix regular and fixed-point multiplication. In general, multiplying an integer with a fixed-point number has a high risk of overflow. On the other hand, when the fixed-point numbers represent small fractions (and they do in our case), multiplying them together can result in precision loss. This occurs when their product before right shifting may be smaller than the value we are right shifting by. As a result, their output will erroneously be zeroed.
To illustrate this point, in the example below, we multiply values with different sizes with two integers: 50, 500. We compare multiplying before right shifting or after.

a: 0.001000 b: 0.000400 int: 50, c_before: 1 c_after: 0
a: 0.001000 b: 0.000400 int: 500, c_before: 13 c_after: 0
a: 0.100000 b: 0.000400 int: 50, c_before: 130 c_after: 150
a: 0.100000 b: 0.000400 int: 500, c_before: 1300 c_after: 1500
a: 0.001000 b: 0.400000 int: 50, c_before: 1300 c_after: 1300
a: 0.001000 b: 0.400000 int: 500, c_before: 13000 c_after: 13000
a: 0.100000 b: 0.400000 int: 50, c_before: -14 c_after: 131050
a: 0.100000 b: 0.400000 int: 500, c_before: -140 c_after: 1310500

We can see that for small a and b values, we benefit from multiplying before right shifting, but we overflow when their values are larger. On the other hand, we get slightly different results when we multiply after (precision loss). Finally, only when a=0.001 and b=0.4 do both methods output the same result.

A simple solution is to calculate the product of the scale factors and compare it to 1 (in fixed-point). When they are larger than one, we risk overfitting, and we should multiply the matrix values after fixed-point multiplication. Otherwise, we multiply the matrix values within the fixed-point multiplication.

Finally, we define Dequantize as:

ReLU

The C implementation of ReLU is straightforward

argmax

We calculate argmax over the columns of an NxM matrix (where N is the batch size and M is the number of labels).

Linear Layer

A linear layer is composed of all the previously defined parts:

MLP — combining several linear layers together

For this tutorial, the model parameters and scale factors are coded in a header file, and the run_mlp function is called from python using C-types.

Model Evaluation

Evaluating integer-only C model on test dataAccuracy: 97.27%

which is 0.18% less than our float model in python.

Conclusion

In this article, we discussed Post-Training Quantization, the fixed-point representation, and how to run integer-only inference in native C using a classifier trained with full precision in PyTorch.

For the complete code, including a convolutional neural network version, visit the Github repository: https://github.com/benja263/Integer-Only-Inference-for-Deep-Learning-in-Native-C

Thank you for Reading!

References

[1] Tessler, C., Shpigelman, Y., Dalal, G., Mandelbaum, A., Kazakov, D. H., Fuhrer, B., Chechik, G., & Mannor, S. (2021). Reinforcement Learning for Datacenter Congestion Control. http://arxiv.org/abs/2102.09337

[2] Wu, H., Judd, P., Zhang, X., Isaev, M., & Micikevicius, P. (2020). Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. http://arxiv.org/abs/2004.09602

--

--

AI Engineer interested in deep reinforcement learning and generative models