Machine Learning at the Edge!

Getting Machine Learning/image processing onto the GPU on a small processor running embedded Linux using OpenCL for real-time performance

Peter Gaston

Published in

Towards Data Science

7 min readJul 1, 2020

Contents:

Overview
Travels to find a workable solution
Build your own working solution
Lessons Learned.

Overview

So, I have a working machine learning (ML) model that I want to move to the edge.

By the way, my ML model processes images for depth estimation to provide perception capabilities for an autonomous robot. The machine learning model used is based on Fast Depth from MIT. This is a U-Net architecture focused on speed. It uses a MobileNet encoder and a matching decoder with skip connections.

It was developed using Keras (PyTorch at first) on Python with a CUDA backend.

By moving to the edge, I mean I needed this running on a small CPU/GPU (Qualcomm 820) running an embedded Linux where the GPU (Adreno 530) can only be accessed via OpenCL (i.e., not CUDA.)

Caveat — If you’re on iOS or Android you’re already relatively home free. This article is for getting machine learning on the GPU on embedded Linux using OpenCL.

Travels to Find a Workable Solution

This is going to be easy. Hah! Turns out once you leave CUDA behind you’re in the wilderness…

There are really two broad approaches to deploying a model at the edge.

Try and duplicate your development environment on the edge and let it run there. If this is possible, it is always a good first step, if just to convince yourself how slow it’s performing. I explored this world to the extent possible, though in my case I didn’t even have a full Linux on my device so I couldn’t even support Python.
Find an inferencing framework that can run your code in a more high performance, low resources context. This requires more work on your part, especially as you are going to have to take your C++ coding skills out for a spin. This is where we are going to explore.

Through googling as well as references, I came across a laundry list of potential approaches.

Again, I have a Keras model that I want to run as fast as possible on a Qualcomm 820 GPU (actually the Adreno 530 is the GPU) on embedded Linux. No quantized approaches were evaluated.

Here is the world I explored: (i.e., lots and lots of ‘dead ends’)

Explanations:

Keras — the model was developed in Keras.
ONNX — some options offered an ONNX import, e.g., OpenCV. It turns out ONNX, at least at that time — 2019 did not support all the Keras operators I was using.
Tensorflow — this is an important layer as most if not all of the engines import/convert from Tensorflow and not Keras. The final solution — MACE required a relatively older version of Tensorflow — they called for v1.08 — though v1.14 worked… See SYKL below for using Tensorflow runtime.
Tensorflow Lite/TF-Lite. Designed for this situation, however, there was no way (I could find) to hook it up to the GPU via OpenCL.
SYKL — supposedly a way to hook up Tensorflow to OpenCL. I had to give up after making little progress.
Arm Compute Library. This works, but with the downside on no way (I could find) of importing models. Use these or code in their language.
ARM Neural Network. An ‘official’ way to do this from the manufacturer — ARM. It should be easy. Heck, I got stuck somewhere in the massive installation area and could never get it to work. This is one group that could really benefit from providing a Docker environment.
Mobile AI Compute Engine — MACE. Designed for, at least partly this use case! From Xiaomi. This actually works! And it’s efficient! And their support is good — these guys are great! More below.
Mobile Neural Network — MNN. Just playing with this now as I just recently ‘found’ this, but this looks great with potentially an even easier transition from Keras to the platform as well as potentially more optimization. From Alibaba.
tvm.ai. Designed for executing on GPUs, especially for efficiency. However, it required more than my mini-Linux could provide for getting started, as well as not really having their heart into OpenCL. I would like to get this to work, especially to play off of MNN for efficiency.
OpenCV. The DNN package is designed to run ONNX models, and it does. If verrryyyy slowww. Very disappointing. Also, it turns out I couldn’t use OpenCV on my run-time so this was a non-starter anyway.
Snapdragon Neural Processing Engine. From Qualcomm, heck this should be easy. First, support from Qualcomm makes Comcast look like stars. Heck, stop there — this tool is a non-starter in my environment.
OpenVINO — looked promising but seems to be for Intel only.

Build your own working solution on MACE

Here are the rough steps to get your model running on the GPU (on an embedded Linux system using OpenCL!) This is pretty much a voice-over on their documentation, with a few key hints tossed in.

Clone MACE. https://github.com/XiaoMi/mace
Get the Docker environment working. I used the full environment.
Configure for your target. There are multiple architecture options as well as a Linux/Android choice. In my case, my architecture option was armeabi-v7a. (It’s a longer tale why that and not the 64bit architecture.) Unfortunately, the configured OS for that architecture was Android. So I had to mix and match a little to get Linux.
Build MACE for that target.

All of that is a one-time, startup event (once it works).

Now, to get the MACE conversion of your Keras model…

Enter Tensorflow

Oh, wait. It turns out that MACE only supports Tensorflow up to, roughly v1.14. And no Keras at all. So I had to convert my Keras model to Tensorflow of the right vintage. This was some work, basically making the exact same model in Tensorflow and then copying over the weights. And iterating, because nothing is ever that easy. A good tool is Netron for looking inside and comparing the nitty-gritty of your models.

An optimization is that you can also easily add image preprocessing for performance gains. For example, these lines are not in the Keras model, only upfront in the Tensorflow model. This allows us to move preprocessing off the CPU and onto the GPU, specifically an image resize and data normalization.

Rs224 = tf.image.resize_bilinear(input,(224,224), name='resize224')
# normalize data, identical to how we trained
Norm224 = tf.scalar_mul(1./255., Rs224)
Norm224 = tf.scalar_mul(2., Norm224)
Norm224 = tf.math.add(-1., Norm224)

Continuing the Conversion to MACE

Back to the conversion:

Create your YAML file. Note that you need to create a sha256sum code for every new version of your model.
Do your conversion.

python tools/python/convert.py — config ../mace-models/yourmodel.yml

3. Test your conversion. Okay, MACE sort of fell down here. Turns out the Linux version didn’t really work without some jumping in to move things. What should work is…

python tools/python/run_model.py  \
    --config ../../mace-models/yourmodel/yourmodel.yml \
    --validate \
    --target_abi arm-linux-gnueabihf \
    --target_socs root@192.168.8.1 \
    --device_conf ../../mace-models/820.yml

But it doesn’t. It turns out their script fails. (Fix — You need to manually move the ‘/tmp/mace_run/yourmodel/validate_in/xxx’ up one level… and then do the ./mace_run command that’s in their script.)

C++ for Runtime

And lastly, write your C++ code to run on your target environment. I was running inside of a ROS node that listened for new camera image events. This node would process them and create a depth cloud output for downstream processing by, say an Octomap.

The key parts of calling the MACE inference engine included:

// define input node (output nodes not shown)
const std::vector<std::string> mace_input_nodes{"input"};
const std::vector<int64_t> mace_input_shape{{1, 480, 640, 3},};
...
// initialize the 'engine' (though of course more surrounds this)
create_engine_status = CreateMaceEngineFromProto(reinterpret_cast<const unsigned char *>(
                              model_graph_data->data()),
                              model_graph_data->length(),
                              reinterpret_cast<const unsigned char *>(
                              model_weights_data->data()),
                              model_weights_data->length(),
                              mace_input_nodes,
                              mace_output_nodes,
                              config,
                              &engine);
...
// execute model
run_status = engine->Run(inputs, &outputs, nullptr);

The real good news. I was hoping for 10 frames per second. Even after adding in the image pre-processing shown previously (normally in the C++ code) I finally ended up achieving 25 fps. Awesome. So much so that other parts of the overall system became the gating factor. I actually have to throttle down the execution of the ML inferencing so as to not overload other parts of the system!

Lessons Learned

For all these inferencing approaches, ensure that your operations are supported. For example, OpenCV does not support UpSample.
You have got to try things out to see if they are going to work.
If the tool vendor provides a Docker environment, chances are much higher that it is going to work.
If they’re responding to problems on their discussion forum — that’s a good sign. If not, well — run, don’t walk away.
Try multiple approaches, as many will fail in some way, usually one you don’t suspect.

Good luck!

Machine Learning at the Edge!

Getting Machine Learning/image processing onto the GPU on a small processor running embedded Linux using OpenCL for real-time performance

Written by Peter Gaston