Introduction to OpenVINO

Dhairya Kumar
Towards Data Science
7 min readNov 28, 2019

--

A comprehensive guide for understanding the functioning of the OpenVINO toolkit

OpenVINO stands for Open Visual Inference and Neural Network Optimization. It is a toolkit provided by Intel to facilitate faster inference of deep learning models. It helps developers to create cost-effective and robust computer vision applications. It enables deep learning inference at the edge and supports heterogeneous execution across computer vision accelerators — CPU, GPU, Intel® Movidius™ Neural Compute Stick, and FPGA. It supports a large number of deep learning models out of the box. You can check out this link to know more about the model zoo.

Prerequisites

If you want to run the code sample provided at the end of this article, then make sure that you have properly downloaded and configured the OpenVINO toolkit.

Overview

The execution process is as follows —

  • We feed a pre-trained model to the Model Optimizer. It optimizes the model and converts it into its intermediate representation (.xml and .bin file).
  • The Inference Engine helps in the proper execution of the model on different devices. It manages the libraries required to run the code properly on different platforms.

The two main components of the OpenVINO toolkit are Model Optimizer and Inference Engine. So, we will dive into their details, to better understand their role and internal working.

Model Optimizer

Model optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment. It adjusts the deep learning models for optimal execution on end-point target devices.

Working

Model Optimizer loads a model into memory, reads it, builds the internal representation of the model, optimizes it, and produces the Intermediate Representation. Intermediate Representation is the only format that the Inference Engine accepts and understands.

The Model Optimizer does not infer models. It is an offline tool that runs before the inference takes place.

Model Optimizer has two main purposes:

  • Produce a valid Intermediate Representation. The primary responsibility of the Model Optimizer is to produce two files (.xml and .bin) that form the Intermediate Representation.
  • Produce an optimized Intermediate Representation. Pretrained models contain layers that are important for training, such as the Dropout layer. These layers are useless during inference and might increase the inference time. In many cases, these layers can be automatically removed from the resulting Intermediate Representation. However, if a group of layers can be represented as one mathematical operation, and thus as a single layer, the Model Optimizer recognizes such patterns and replaces these layers with only one. The result is an Intermediate Representation that has fewer layers than the original model. This decreases the inference time.

Operations

  1. Reshaping
  • The Model Optimizer allows us to reshape our input images. Suppose you have trained your model with an image size of 256 * 256 and you want to convert the image size to 100 * 100, then you can simply pass on the new image size as a command-line argument and the Model Optimizer will handle the rest for you.

2. Batching

  • We can change the batch size of our model at inference time. We can just pass the value of batch size as a command-line argument.
  • We can also pass our image size like this [4,3,100,100]. Here we are specifying that we need 4 images with dimensions 100*100*3 i.e RGB images having 3 channels and having width and height as 100. Important thing to note here is that now the inference will be slower as we are using a batch of 4 images for inference rather than using just a single image.

3. Modifying the Network Structure

  • We can modify the structure of our network, i.e we can remove layers from the top or from the bottom. We can specify a particular layer from where we want the execution to begin or where we want the execution to end.

4. Standardizing and Scaling

  • We can perform operations like normalization (mean subtraction) and standardization on our input data.

Quantization

It is an important step in the optimization process. Most deep learning models generally use the FP32 format for their input data. The FP32 format consumes a lot of memory and hence increases the inference time. So, intuitively we may think, that we can reduce our inference time by changing the format of our input data. There are various other formats like FP16 and INT8 which we can use, but we need to be careful while performing quantization as it can also result in loss of accuracy.

Using the INT8 format can help us in reducing our inference time significantly, but currently, only certain layers are compatible with the INT8 format: Convolution, ReLU, Pooling, Eltwise and Concat. So, we essentially perform hybrid execution where some layers use FP32 format whereas some layers use INT8 format. There is a separate layer that handles these conversions. i.e we don’t have to explicitly specify the type conversion from one layer to another.

Calibrate layer handles all these intricate type conversions. The way it works is as follows —

  • Initially, we need to define a threshold value. It determines the drop in accuracy are we willing to accept.
  • The Calibrate layer then takes a subset of data and tries to convert the data format of layers from FP32 to INT8 or FP16.
  • It then checks the accuracy drop and if it less than the specified threshold value, then the conversion takes place.

Inference Engine

After using the Model Optimizer to create an intermediate representation (IR), we use the Inference Engine to infer input data.

The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices.

The heterogeneous execution of the model is possible because of the Inference Engine. It uses different plug-ins for different devices.

Heterogeneous Plug-in

  • We can execute the same program on multiple devices. We just need to pass in the target device as a command-line argument and the Inference Engine will take care of the rest i.e. we can run the same piece of code on a CPU, GPU, VPU or any other device compatible with the OpenVINO toolkit.
  • We can also execute parts of our program on different devices i.e. some part of our program might run on CPU and other parts might be running on a FPGA or a GPU. If we specify HETERO: FPGA, CPU then the code will run primarily on an FPGA, but if suppose it encounters a particular operation which is not compatible with FPGA then it will switch to CPU.
  • We can also execute certain layers on a specific device. Suppose you want to run the Convolution layer only on your GPU then you can explicitly specify it.
  • The important thing to note here is that we need to be careful about the data format while specifying different hardware. Not all devices work with all the data types. Example — The Neural Compute Stick NCS2, which comes with a Movidius chip, doesn’t support the INT8 format. You can check out this link to get complete information about the supported devices and their respective formats.

Code Sample

The original code sample provided by Intel can be found here.
I modified the code sample to make it simpler and my version can be found here.

I will only explain the OpenVINO specific code here.

# Initialize the classinfer_network = Network()# Load the network to IE plugin to get shape of input layern, c, h, w = infer_network.load_model(args.model, args.device, 1, 1, 2, args.cpu_extension)[1]

We are initializing the Network class and loading the model using the load_model function.
The load_model function returns the plugin along with the input shape.
We only need the input shape that’s why we have specified [1] after the function call.

infer_network.exec_net(next_request_id, in_frame)

The exec_net function will start an asynchronous inference request.
We need to pass in the request id and the input frame.

res = infer_network.get_output(cur_request_id)for obj in res[0][0]:  if obj[2] > args.prob_threshold:    xmin = int(obj[3] * initial_w)    ymin = int(obj[4] * initial_h)    xmax = int(obj[5] * initial_w)    ymax = int(obj[6] * initial_h)    class_id = int(obj[1])

This is the most important part of the code.
The get_output function will give us the model’s result.
Each detection is represented in the following format —
[image_id,class_label,confidence,x_min,y_min,x_max,y_max]

Here, we have extracted the bounding box coordinates and the class id.

And with that, we have come to the end of this article. Thanks a ton for reading it.

My LinkedIn, Twitter and Github.
You can check out my website to know more about me and my work.

--

--