A 3-minute read on how to accelerate NLP model inferences on commodity hardware.

A compass to navigate the landscape of accelerated inference

Prithivi Da
Towards Data Science

--

Prologue

Accelerating server-side model inference in data centers using specialized hardware in serial and/or parallel is proven and simple (read simple and not easy). Datacenters have been using specialized hardware namely: Graphics processing units (GPUs), Field programmable gate arrays (FPGAs), and ASICs (e.g Google TPU) for some time now, and thanks to cloud service providers who made them very accessible. Now specialized hardware like FPGAs are even showing up on consumer devices like smartphones ( FPGA’s on iPhones), So, ASICs on phones aren’t a stretch by any imagination. Companies like LeapMind in Tokyo are actively working on specialized hardware for training and accelerated inference for constrained and edge devices. But using specialized hardware for accelerating inference is a very narrow world and leaves a huge opening. But luckily there has been a lot of research on accelerating inference in the absence of any specialized hardware i.e. in commodity hardware and this post will act as a compass for you to navigate through the options available.

The big question

How to accelerate model inference in commodity hardware at the server-side and client-side (both fat clients and constrained/edge devices) ?

But why?

There are several reasons to accelerate model inference on commodity hardware especially on the client-side because the “inference budget” is usually very less. Also, not all ML-based solutions can offer an optimal user experience with server-side inferences because of latency and bandwidth issues. For some apps, data privacy is another challenge: consumers don’t want the data to leave their devices like a smartphone. Innovations like differential privacy and federated learning do help alleviate the concerns but are complex to implement. Hence we are better off looking at some options to operate within the tighter inference budgets.

A mental model for NLP based models

In commodity hardware accelerating inference is tricky because it isn’t the only variable in the inference budget equation. The model size and how much accuracy we might trade-off are also crucial. So here are some of the techniques that can help compress models (optionally ) and/or accelerate the inference without making huge compromises in the accuracy.

The scope of this post is to share a broad perspective on available options and give general recommendations. So you won’t find much information on how each of these techniques works internally in this post. I will write a follow-up.

Options available

Any inference request in NLP has two parts: One is input tokenization and the second is the actual model prediction call. HuggingFace has nailed the tokenization part by democratizing the latest researches in tokenization. Here is a summary.

NLP Inference on commodity hardware — Options available (Image by Author)

Fat clients / Server-side inference (e.g Outlook auto-response)

Path for fat clients and server-side inference on commodity hardware (Image by Author)

Thin client inference (e.g Browser-based)

Path for thin clients like browser-based inferences hardware (Image by Author)

Constrained / edge device inference (e.g Smart Phones, IoT devices)

Path for edge or constrained device inference (Image by Author)

--

--