From cloud computing to fog computing
While Machine Learning inference models are already transforming computing as we know it, the hard truth is that using multiple, gigantic datasets to train them still takes a ton of processing power. Using a trained model to predict an event requires much less, but still more than our pocket devices can currently handle. Processing an image with a deep neural network implies billions of operations – huge matrices of many dimensions being multiplied, inversed, reshaped, and Fourier transformed.
Given ML’s potential, it’s no surprise that more than 60 companies, including Intel, Nvidia, and Google, are designing chips that can execute these operations on a far more rapid scale. With this technology, it takes even a Raspberry Pi only a few milliseconds to pass through a deep neural network and produce an inference (a prediction, essentially). The opportunities for these chips are manifold, and they are increasingly being deployed in many devices like cameras and drones. You can find a Myriad chip from Intel Movidius in your DJI drone, for example or, soon, in the European Space Agency’s satellites to do in-orbit image processing
Like many of you, I am passionate about machine learning, and eager to test the technologies empowering the 4th industrial revolution. So when Google publicly released two new products that accelerate inference, I jumped on the wagon. I was particularly eager to understand what it takes to run a deep neural network using these devices.
In what follows, I’ll walk you through my attempt to create a deep neural network in Tensorflow, to port this model on a Raspberry Pi, and to make the inference using an external device to increase performance. I ran this experiment with chips from the two leading players: Intel and Google.
My goal here is simply to evaluate both environments and to understand the depth of the knowledge required to use these devices. The goal was not to figure out the best possible performances with regards to the models, losses, metrics, and architecture. One can easily imagine improving many aspects of what I did.
All code snippets from the project are available in my github repository.
Let’s dig in!
Setup and goal
The plan is to design a fresh Convolutional Neural Network with Tensorflow. As these chips work faster with integers or half precision floating points, we will build a model running on a low precision data-type. To do this, we will operate a quantization aware training for our model. Then, we will freeze the graph to compile it into the different representations the two accelerators support. Finally, we’ll try our model on the edge.
You may be asking: why build a new model when a glut of excellent pre-trained models already exist?
Good question. There are many available models pre-trained on huge datasets, some of which are even compiled for the device we want to use. You can get amazing results in a few hours doing object detection with a MobileNetV2 SSD.
Our goal, however, isn’t to drive a racing car but to understand the challenges involved in building it. t’s to understand the nitty-gritty of what "inference on the edge" means.
For this particular project, I used chips from Intel and Google. In front of me, from left to right, we have the Intel Neural Compute Stick 2, the Google Coral Accelerator and the Google Coral Board.

The Coral Accelerator and the Intel Neural Compute Stick 2 will be connected to a Raspberry Pi3b. Both Raspberries are running on a freshly deployed Raspbian (Raspbian GNU/Linux 9stretch). The Coral Dev Board is out of the box with Mendel (Mendel GNU/Linux 2 Beaker). The code I’ll include in this article runs also on the Coral Dev Board, but here I’ll focus on the USB keys.
Build and quantize a Tensorflow model with CIFAR10
CIFAR10 is a public dataset that can be downloaded here. It’s a set of 60,000 pictures (32*32 pixels). Each picture belongs to one of the following 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship or truck.

The aim is to build a deep neural network that can predict what’s in the pictures. As you can see with the above examples, doing this is not easy.
There are different methods one can use to build a deep neural network in Tensorflow. The fastest is certainly to use a Keras model, or to build your own. I did both. First, I instantiated a MobileNetV2 network. Then, I built a convnet with 3 modules. I used post-quantization to quantize the two models both of which performed very well. Unfortunately, post-quantization is not currently supported on the Google edge-tpu. It worth mentioning that I was successful running the convnet on the Intel NCS2.
This preliminary recon shows how sensitive these devices are with regards to the type of data they are processing. This is certainly a domain the players will heavily invest in in the coming years. Anyway, as post quantization is not supported on the edge-tpu, we need to build a quantized deep learning network. As Keras does not support this, we have two options. We can use the low level Tensorflow API or build an Estimator. Let’s try to save some time by building an Estimator with Tensorflow and do quantization aware training.
Our first layers will be composed of two convnet modules. Then, we’ll process the tensors in a dense layer of 10 units to cover our 10 categories. An Estimator has different methods for training, evaluating, and predicting. The training and the evaluating methods build their own graph. If you go step by step in the notebook, you’ll notice the function cnn_model_fn. This is where we add the layers and do the plumbing.
This is the tensorboard graph representation of what we built. As you can see, the network is pretty small compared to the pre-trained models. We can even print it on a page!

In the cnn_model_fn function, two lines were added for quantization. The two calls tf.contrib.quantize.create_training_graph() and tf.contrib.quantize.create_eval_grap() respectively create the quantized aware training graph and the evaluation graph (the one we want). These two functions add a lot of nodes to our graph to fake the change in data type and help produce the quantized graph.
After defining the model function, we train our Estimator on the CIFAR dataset using the AdamOptimizer and a softmax cross entropy for the loss. It’s interesting to notice that the contrib function to create a training graph gives you the option to start the training with a high precision data-type and to end it with a low-precision data-type.
Here is an example of what you should see in Tensorboard for the loss.

Once the model is trained, we’ll start an evaluation on the 5000 features we kept aside to validate our model. Then we’ll ask the model to predict a random sample to roughly assess its predictive accuracy.

The graph we need to export was created during the evaluation phase. Estimators make it easy to create a saved_model. The classifier.experimental_export_all method is worth spending some time on, as you can select the tag map you want to export.
Once you export your saved_model, you have to freeze it by transforming the variables of your latest checkpoint into constants (like the weights of the network). Check out this jupyter notebook for loading, checking, and freezing your saved_model.
Test #1: Inference with the Google Accelerator
Google announced the Coral Accelerator and the Dev Board on March 26, 2019. Resources for it are relatively limited right now, but Google is busy adding more on a daily basis.
Deploying the environment for the Google Accelerator is simple. Just follow the getting start tutorial from Google. In 10 minutes, you’re good to go. I highly recommend using a virtual environment. Once everything is installed, you can play with both detection and categorisation models.

Now that we have our environment, let’s have our model make an inference. First, we need to compile our saved_model into a tflite model. To do this we have to use TOCO, Google’s converter.
If you follow the steps of the notebook until the end, you’ll just have to run:
toco -graph_def_file=frozen.pb -output_file=tflite_model.tflite -input_format=TENSORFLOW_GRAPHDEF -output_format=TFLITE -inference_type=QUANTIZED_UINT8 -input_shape="1,32, 32,3" – input_array=model_input/input -output_array=softmax_tensor -std_dev_values=127 -mean_value=127 -default_ranges_min=0 -default_ranges_max=6
Besides the name of the input and the name of the output tensors, you can see that it’s important to specify the deviation, the mean, and also the default ranges for the Relu. This is why we decided to use the Relu6.
Just one more step before moving the assets to the Raspberry: convert the tflite model into an edgetpu.tflite model. There is an online tool for that. You should see something like:

If you don’t, it most probably means that you do not comply with the requirements.
Now let’s write the code to make the inference. The edge-tpu API is really simple to use. You have to use either the ClassificationEngine or the DetectionEngine. Both engines extend the BasicEngine and provide simple methods for loading the model and running the inference. The python file I used to run our model can be found here.
Once the python file and the resources (i.e. the edge_tpu_model and at least one picture from CIFAR10) are uploaded you can run the model and verify that it works. You should get something like:

Congratulations! You have successful built and run our quantized model on the Raspberry Pi using the Google Accelerator.
It’s worth noting that the Google edge-tpu API allows for training on the device with the Accelerator. This is an amazing feature if you want to do online learning, for example.
Test #2: Inference with the NCS
Movidius announced the Neural Compute Stick back in July 2017, and are currently a few laps ahead of Google. This technology supports more environments and has different APIs to work with.
I started playing with the Intel Movidius NCSDK. The NCSDK includes a set of software tools to compile, profile, and validate deep neural networks , and also includes the Intel Movidius NCAPI for application development in C/C++ or Python. Unfortunately, the NCSDK V2 is only for the NCS V1. It does not support the NCS V2. (Yes, this is confusing.)
For the NCS V2, you have to use Openvino. OpenVino is a recent development that aims to unify the different Intel platforms for CPUs, GPUs, VPUs (Movidius), and FPGAs (Altera). I really like the OpenVino architecture, which lets you use different OpenVino plugins depending on your target.
My only regret is that the NCSDK was really powerful and offers a great deal of customization; it was too bad I couldn’t play with it here. On the other hand, I like that OpenVino is coupled with OpenCV. This will certainly help in the future. Last but not least, the amazing news is that both the NCSDK and OpenVino are open source licensed under the Apache License 2.0.
With regards to findinging the right OpenVino package for your Raspberry, I recommend visiting the Intel download center. I am using l_openvino_toolkit_raspbi_p_2019.1.094.tgz. Remember that you also need to install OpenVino on your desktop, as this is where you’ll use all the tools to compile, profile and validate you DNNs.
To deploy, follow Intel’s install documentation for the Raspberry. This is a more complicated process than the one in the Google environment. (Which makes sense, as the more options you have, the more complicated the setup is.)
Once you have installed the OpenVino environment, you can play with a really cool demo that chains three models. In the picture below, you can see how they work. The first model detects the car. Then a second model detects the license plate. The third model reads the plate.

To port our frozen model on the Raspberry, we have to use the Model Optimizer. The model optimizer is not on the Raspberry, which is why you need OpenVino on your desktop.
The model optimizer is a very powerful converter, and comes with many different options. You have generic parameters and also specific parameters for the different ML frameworks.
You’ll find the parameters I used at the end of this notebook. Once you have the two files (.xml and .bin) of your intermediary representation, you move them to your workspace on the Raspberry. You also need at least one CIFAR10 picture and the python file to make the inference. You should see something like this:

I spent some time playing with the OpenVino API. I tried to understand how to connect a pair of cameras and to use the depth module. I also tried to use net.forwardAsync() to increase performance. I was unsuccessful. The OpenVino API is very recent and it seems to lag behind the Movidius API in terms of capabilities. Unfortunately, I think we’ll have to wait for one or two more releases to put our hands on these advanced features.
Relative Performance: Google vs. Intel
Both SoCs have very impressive specifications and plenty of processing power.
The Myriad X is the third generation of Intel Movidius VPU (Vision processing unit). The SoC is designed to handle 8 RGB sensors and 4k processing at both 30Hz and 60Hz. It handles 700 million pixels per second. The chip uses over 20 hardware accelerators to perform optical flow and stereo depth. On top of that, 16 programmable 128-bit vector processors enable you to run multiple concurrent vision application pipelines.
Unfortunately, Google gives far less information about the edge-tpu. They do share that the graphic processing unit of the edge-tpu has 4 shaders and that it handles a staggering 1.6 Gigapixel per second. In addition, Google claims the video processing unit is able to process 4kp60 in H.265 and 1080p60 for mpeg 2.
On the paper the Intel SoC seems more powerful that the edge-tpu.
The performance-per-watt of both devices seems to be really good. The two pictures below show the power consumption of the two devices at rest. It looks like the NCS V2 needs more mW but this needs to be confirmed under stress. Google kindly notifies us during the setup that the accelerator might become very hot. I didn’t notice any significant change in temperature when running object detection on either Raspberry Pi. Most likely the Pi Cam specifications and the USB-2 limitation make it hard to put the devices under enough stress.


To load the devices I made a very short python program with one loop making the inference always with the same image. The idea was to lower the incidence of the usb2 bus but a quick wireshark session showed that the image moves every time we make an inference.
I measured 70ms per inference on a mobilenet v1 for the Myriad, 60ms for edge-tpu on a mobilenet v2, and 550 ms on the Raspberry alone. The mobilenet v2 (3.47 millions MACs) with it’s bottleneck layers requires less computations than the v1 (4.24 millions MACs). It is accepted that the mobilenet v2 is between 15 to 30% faster. Therefore, on the Raspberry Pi, the two devices would be in the same ballpark, with a small lead for the Myriad.
It is important also to keep in mind that python is not the best option for performance. Both devices have also a C++ API available, which would be my choice for optimal performance. Interestingly, you can stack the Intel Movidius NCS, which is a really nice feature.

Considering the specifications of the two devices, the performance-per-dollar ends up being quite similar. The Google Accelerator is priced at $74.99, and the INTEL NCS V2 is $99. You can always buy the Movidius V1 is at $74.
The design factors are also very similar, making it easy to integrate the usb keys in any projects. The Google Accelerator is thinner than the NCS V2 but has to be connected with a USB-C-to-USB-2 cable. Intel addressed one complain about the V1: on the Raspberry, the V1 design was covering more than one USB port. The V2 covers only one.
Conclusion
The processing power of both devices allows developers to handle a broad scope of use cases.
The many pre-compiled networks make it easy to get quick and good results. Still, fully quantizing your own networks remains an advanced task. The conversion requires an advanced understanding of your network and how the operations work. Additionally, the loss in accuracy is significant when we go from FP_32 to FP_16 and from FP_16 to UINT. It’s interesting to note that the Myriad processes half floating point when the Google Accelerator processes 8-bit fixed-point only. This implies that the Myriad will produce more accuracy.
Intel and Google have clearly taken two very different approaches. Google’s strength lies in the ease with which one can build and market an integrated workflow from the Google Cloud Platform to the edge-tpu. I really like how all components fit together. Intel, on the other hand, offers Intel intermediary representation and Openvino plugins that developers can use to optimize their networks to run on all kind of hardware. OpenVINO currently supports Intel CPUs, GPUs, FPGAs, and VPUs. The challenge for Intel is that a unifying strategy will always struggle to leverage the advanced features of each component.
Both systems offer plenty of advanced features. The Google Accelerator is able to train your networks online, which is crucial for transfer learning. Clearly Google believes that their pre-trained networks and transfer learning offers a very efficient combination. The Intel NCS has three pairs of stereo-depth hardware built-in, which can be priceless for many uses cases, like object avoidance.
I really enjoyed being part of Intel’s community. The members are very helpful. I got no help from Google, but this is certainly because its product line is very young. Intel’s documentation is very abundant, if sometimes confusing. Google’s documentation was almost non-existent at the time I performed this experiment. Since then I have noticed that this situation is rapidly improving, and I expect to soon see a very good learning platform (colab) with lots of resources.
Inference on the edge is definitely exploding, and one can see astonishing market predictions. According to ABI Research, in 2018 shipment revenues from edge AI processing was US$1.3 billion. By 2023 this figure is expected to grow to US$23 billion.
Clearly, one solution won’t fit all as entrepreneurs figure out new ways to deploy machine learning. It is going to be interesting to see what market shares both players end up capturing. My bet is that the Google Cloud Platform will drive many transfer learning applications that will shine on the edge, while Intel will manage a very diverse scope of projects on all kinds of hardware with many partners.
In either case, the one certainty is that we have now the tools to innovate on the edge and accelerate the fourth industrial revolution.
I hope this report will help you achieve your goals or at least shave a bit of time on your projects. I welcome any feedback on this, and am always open to new opportunities.