Benchmarking Hardware for CNN Inference in 2018

Published in

Towards Data Science

8 min readAug 27, 2018

CNN inference and edge computing is coming — just about every mobile platform(Apple, Samsung, Huawei) is about to get dedicated hardware to do inference. Moreover, with the advent of self-driving cars, there are robotics platforms like Nvidia’s Jetson Tx2 that brings inference to the edge. This summer I had the opportunity to play around with various hardware platforms and see how the stack up against one another for CNN inference.

Motivation: This summer I had the chance to intern at Aquifi, a 3D vision startup that aims to automate logistic processes in manufacturing. Usually, cameras are attached to a remote server that does inference processing. However, their smart camera also includes a “System on Chip” attached, so we wanted to see if we could run networks at the edge. Sample networks in 3D vision can be small or large, for example:

Small: KCNN for Keypoint Detection
Keypoints have many uses in 3D vision such as recognizing good features for 3D Reconstruction.

Large: MVCNN for recognizing 3D Shapes
MVCNN is a large network that takes many 2D images to recognize a 3D image. It uses a VGG as a backbone, and a “View Pooling” layer that aggregates the feature maps of the individual 2D images.

The various hardware that I tested are:
1) Nvidia Jetson Tx2
2) Movidius 2450 (Intel Neural Computer Stick/Google Vision Kit)
3) Nvidia 1080ti(benchline)
4) Kirin 970 (Huawei phones)
5) Qualcomm 660
6) ActionSemiS900

Factors to Consider

In CNN inference, there are many things to consider. I’ll mostly be looking into inference speed.
1) Power Usage
2) Cost
3) Ease of Development
4) Inference Speed

Where can CNN’s be run in inference?

1)FPGA
2) GPU
3) DSP
4) ASIC
5) x86 or Arm CPU
6) Accelerators(NPE, TPU)

All of these hardware have a variety of trade-offs in terms of flexibility and efficiency. Moreover, companies are now producing new names such as NPU(Neural Processing Unit) or TPU(Tensor Processing Unit), but really these just accelerator for matrix multiplication and related arithmetic operations. Devices will often have multiple types in the same platform. For example, the Qualcomm 660 has a CPU, GPU, as well as DSP(Digital Signal Processor) in the same board. This makes implementing the algorithms difficult, because you must implement all the primitive layers for all the hardware. The below graph shows compatibility for different layer types on the Qualcomm platform.

https://developer.qualcomm.com/docs/snpe/network_layers.html

Compilers

So you’ve trained a model in Tensorflow/Pytorch, how do you actually get it to run on these specific hardware? Although you can generally run networks on a GPU or CPU for training, during inference, most platforms generally require you to run their own proprietary compiler on the network.

Nvidia: TensorRT
Google Vision Kit: Vision Bonnet Compiler
Kirin 970: Huawei Graph Compiler
Qualcomm: SnapDragon-Tensorflow-to-DLC
Neural Compute Stick: Neural Stick Graph Compiler

Although these compilers have different implementations, under the hood we can assume they do some of these things:

1) Auto-Layer Fusing
2) Memory Optimization
3) Allocation between different Hardware processors(CPU, GPU, DSP, NPU)
4) Weight & Activation Calibration(INT8 or FP16)

Models

InceptionV3: ~92 Mb, ~23.8 Million Parameters
InceptionV2_Resnet: ~214Mb, ~55.8 Million Parameters
MobilenetV1_0p5_128: ~1.89Mb, ~0.5 Million Parameters
MobilenetV1_1p_224: ~16.3Mb, ~4.2 Million Parameters
MobilenetV1_0p5_160: ~5.2Mb, ~1.3 Million Parameters

The models chosen are the ones that ended up working for the most platforms. I excluded VGG because it’s rather large, and would never be run for mobile. Resnets are also a popular choice that was excluded.

Results

InceptionV3

It is sometimes hard to evaluate fairly when each platform has different capabilities. For example, in the graph, Qualcomm quantizes the inference to 8 bits. In Nvidia TensorRT, you are given the choice of using FP32 or FP16. Kirin 970 supports both 8-bit and 1-bit quantizations. In this graph, some interesting points
1) Intel Neural Compute Stick was the slowest of the bunch, 3 times slower than the Intel i7–8700k CPU.
2) Nvidia Jetson Tx2 GPU run was the same speed as Intel i7–8700k CPU
3) 1080ti is ~10x faster than Intel i7–8700k CPU
4) Kirin970 and Qualcomm 660 mobile platforms are similar speeds
5) Jetson Tx2(Float TensorRT) are similar speeds with mobile platforms, although not exactly a fair comparison because FLOAT vs 8-bit inference.

Mobilenets

With smaller mobilenets, this enables less powerful hardware like the ActionSemiS900 or even RaspberryPi to do inference. Mobilenets are the state-of-the-art in edge computing, and there are two main parameters you can tune: 1) width multiplier 2) size of input image.

The graphs below show two variations of
1) width_multiplier = 0.25 & input_size = 128
2) width_multiplier = 0.5 & input_size = 160

With the smallest mobilenet, the hardware can do inference in a few MS. At this point, the spread difference between the hardware are less than 50ms. Thus, things like loading the model weights, transferring data from CPU to GPU, and various other things might take longer than inference itself.
Interesting Points
1) The Intel Neural Stick is 4x faster than than the Google Vision Kit, which both use the same underlying Movidius 2450 board. Software implementations of the same layers matter.
2)

Nvidia Tx2 Insights

In terms of speed TensorRT(Half) < TensorRT(Float) < Tensorflow(Float), each faster by a factor of ~2x. We can expect the fastest verson TX2 TensorRT(Half) to be roughly 2–5x slower than 1080ti(Float) for various models — check out THIS to reproduce it.

https://github.com/NVIDIA-Jetson/tf_to_trt_image_classification

Qualcomm 660 Insights

Qualcomm 660 is an older version of the platform, and the current version is 845. In terms of smaller networks like Mobilenets, MobilenetSSD, InceptionV3, the Qualcomm 660 offers good speeds. For example, it can do 10fps for MobilenetSSD with a Mobiletnet_0p25_128 as the backbone. While it is fast, the downside is that the SNPE platform is still relatively new. One issue I’ve had working with this while trying to compile certain state-of-the-art models using the snpe-tensorflow-to-dlc compiler. I’ve read similar issues on the forums, but this is expected for a new platform.

Kirin 970 Insights

Kirin 970 is a bit faster than Qualcomm 660 for InceptionV3. It is even newer than Qualcomm’s SNPE platform as the Huawei HiAI platform was released in May 2018. This platform mainly supports Caffe. There are certain limitations like they only support Tensorflow 1.3 and model sizes have to be <100mb. Most of the newest Mobilenets are trained with Tensorflow 1.6 or above. Thus, it is hard to compile some of the pretrained models on the internet currently.

ActionSemiS900 & CPU TF Results

ActionSemiS900 is a low-power board with a 64-bit Quad-Core Cortex-A53 CPU as well as a G6230 PowerVR GPU. Although PowerVR introduced their PowerVR CLDNN SDK for AI-oriented applications, as of right now it only supports the Chromebooks that have the PowerVR GPU. Thus, I did some testing on the Cortex-A53 CPU on both Tensorflow and TFLite. You can run all the Mobilenets in realtime. TFLite is Google’s approach for edge computing, a successor to TF Mobile. TFLite is a little bit faster for specific models, but it is as of July 2018 not production ready — it is even magnitudes slower for certain bigger models like InceptionResnetV2(not shown in graph).

Other Notable Companies

It really is the wild wild west in terms of hardware. Shown below are 4 companies that want to be in the AI System-on-Chip business specifically. To check out hardware besides System-on-Chips, check out this comprehensive list. A brief summary of these four:
NXP: Have a variety of SoC’s and other solutions like the I.MX 8 Series for CNN inference, uses DeepView ML Toolkit
MediaTek: Chip supplier for medium tier phones, the Helio P60 is going to be similar to Qualcomm or Huawei’s platform. Uses their NeuroPilot AI Platform which is assumed to support TF, Caffe, ONNX, Android NN API
STMicroelectronics: Really giant company, announced in CES in January that it wants to get in this space.
RockChip: A company in Fuzhou that claims their SoC to be faster than Jetson Tx2, Kirin970, Apple A11…

Looking Forward

It can be expected that hardware/platform will play a big role in the future of ML. As you can see, every phone chip supplier(Apple, Huawei, Qualcomm, MediaTek...) are rushing to build their own accelerator. Although the hardware will is/will be available in < 1 year, the software platforms will probably take much longer to catchup. Really, Nvidia has a 10 year head start on these guys, and these chip suppliers are writing software from scratch. Working with different proprietary compilers can be unwieldy, and in the future I hope there is a universal interface like Android NN API that can give access to the hardware.