Photo by trail on Unsplash

From cloud to device

The future of AI and machine learning on the Edge

Unmesh Kurup
Towards Data Science
12 min readSep 10, 2020

--

A brief overview of the state-of-the-art in training ML models on devices. For a more comprehensive survey, read our full paper on this topic.

We are surrounded by smart devices: from mobile phones and watches to glasses, jewelry, and even clothes. But while these devices are small and powerful, they are merely the tip of a computing iceberg that starts at your fingertips and ends in giant data and compute centers across the world. Data is transmitted from devices to the cloud where it is used to train models that are then transmitted back to be deployed back on the device. Unless used for learning simple concepts like wake words or recognizing your face to unlock your phone, machine learning is computationally expensive and data has no choice but to travel these thousands of miles before it can be turned into useful information.

This journey from device to data center and back to device has its drawbacks. The privacy and security of user data is probably the most obvious as this data needs to be transmitted to the cloud and stored there, most often, indefinitely. Transmission of user data is open to interference and capture, and stored data leaves open the possibility of unauthorized access. But there are other significant drawbacks. Cloud-based AI and ML models have higher latencies, cost more to implement, lack autonomy, and, depending on the frequency of model updates, are often less personalized.

As devices become more powerful, it becomes possible to address the drawbacks of the cloud model by moving some or all of the model development onto the device itself. This transfer of model development on to the device is usually referred to as Edge Learning or On-device Learning. The biggest roadblock to doing Edge Learning is model training which is the most computationally expensive part of the model development process especially in the age of deep learning. Speeding up training is possible either by adding more resources to the device or using these resources more effectively or some combination of the two.

This transfer of model development on to the device is usually referred to as Edge Learning or On-device Learning.

Fig 1: A hierarchical view of the various approaches to edge/on-device learning. The boxes in grey are the topics covered in this article and corresponding paper. Image by Author

Fig 1 gives a hierarchical view of the ways to improve model training on devices. On the left are the hardware approaches that work with the actual chipsets. Fundamental research in this area aims at improving existing chip design (by developing chips with more compute and memory, and lower power consumption and footprint) or developing new designs with novel architectures that speed up model training. While hardware research is a fruitful avenue for improving on-device learning, it is an expensive process that requires large capital expenditure to build laboratories and fabrication facilities, and usually involves long timescales for development.

Software approaches encompass a large part of current work in this field. Every machine learning algorithm depends on a small set of computing libraries for efficient execution of a few key operations (such as Multiply-Add in the case of neural networks). The libraries that support these operations are the interface between the hardware and the algorithms and allow for algorithm development that is not based on any specific hardware architecture. However, these libraries are heavily tuned to the unique aspects of the hardware on which the operations are executed. This dependency limits the amount of improvement that can be gained by new libraries. The algorithms part of software approaches gets the most attention when it comes to improving ML on the edge as it involves the development and improvement of the machine learning algorithms themselves.

Finally, theoretical approaches help direct new research on ML algorithms. These approaches improve our understanding of existing techniques and their generalizability to new problems, environments, and hardware.

This article focuses on developments in algorithms and theoretical approaches. While hardware and computing libraries are equally important, given the long lead times for novel hardware and the interdependency between hardware and libraries, the state-of-the-art changes faster in the algorithms and theoretical spaces.

Algorithms

Most of the work in on-device ML has been on deploying models. Deployment focuses on improving model size and inference speed using techniques like model quantization and model compression. For training models on devices, there needs to be advances in areas such as model optimization and Hyperparameter Optimization (HPO). But, advances in these fields improve accuracy and the rate of convergence, often at the expense of compute and memory usage. To improve model training on devices, it is important to have training techniques that are aware of the resource constraints under which these techniques will be run.

To improve model training on devices, it is important to have training techniques that are aware of the resource constraints under which these techniques will be run.

The mainstream approach to doing such resource-aware model training is to design ML algorithms that satisfy a surrogate software-centric resource constraint instead of a standard loss function. Such surrogate measures are designed to approximate the hardware constraints through asymptotic analysis, resource profiling, or resource modeling. For a given software-centric resource constraint, state-of-art algorithm designs adopt one of the following approaches:

Lightweight ML Algorithms — Existing algorithms, such as linear/logistic regression or SVMs, have low resource footprints and need no additional modifications for resource constrained model building. This low footprint makes these techniques an easy and obvious starting point for building resource-constrained learning models. However, in cases where the available device’s resources are smaller than the resource footprint of the selected lightweight algorithm, this approach will fail. Additionally, in many cases, lightweight ML algorithms result in models with low complexity that may fail to fully capture the underlying process resulting in underfitting and poor performance.

Reducing Model complexity — A better approach to control the size (memory footprint) and computation complexity of the learning algorithm is by constraining the model architecture (for e.g. by selecting a smaller hypothesis class). This approach has the added advantage that these models can be trained using traditional optimization routines. Apart from model building, this is one of the dominant approaches for deploying resource efficient models for model inference. Most importantly, this approach extends to even Deep Neural Networks (DNNs) where, as evidenced by Fig 2, there has been a slow but steady progression towards smaller, faster, leaner architectures. This progression has been helped by the increased use of Neural Architecture Search (NAS) techniques that show a preference for smaller, more efficient networks. Compared to the lightweight ML algorithms approach, model complexity reduction techniques can accommodate a broader class of ML algorithms and can more effectively capture the underlying process.

Fig 2. Ball chart of the chronological evolution of model complexity. Top-1 accuracy is measured on the ImageNet dataset. The model complexity is represented by FLOPS and reflected by the ball size. The accuracy and FLOPS are taken from original publications of the models. The time of the model is when the associated publication is first made available online. Image by Junyao Guo.

Modifying optimization routines — The most significant of the algorithmic advances is the design of optimization routines specifically for resource-efficient model building where resource constraints are incorporated during the model building (training) phase. Instead of limiting the model architectures beforehand, these approaches can adapt optimization routines to fit the resource constraints for any given model architecture (hypothesis class).

Resource-constrained model-centric optimization routines focus on improving the performance of models that will be quantized after training either through stochastic rounding, weight initialization, or by introducing quantization error into gradient updates. Also prevalent are layer-wise training and techniques that trade computation for memory, both of which try to reduce the computational requirements associated with training DNNs. In certain cases, this approach can also dynamically modify the architecture to fit the resource constraints. Although this approach provides a wider choice of the class of models, the design process is still tied to a specific problem type (classification, regression, etc.) and depends on the selected method/loss function (linear regression, ridge regression for regression problems).

Resource-constrained generic optimization routines such as Buckwild! And SWALP focuses on reducing the resource-footprint for model training by using low-precision arithmetic for gradient computations. An alternative line of work involves implementing fixed point Quadratic Programs (QP) such as QSGD or QSVRG for solving linear Model Predictive Control (MPC). Most of these algorithms involve modifying fast gradient methods for convex optimization to obtain a suboptimal solution in a finite number of iterations under resource-constrained settings .

Data Compression — Rather than constraining the model size/complexity, data compression approaches target building models on compressed data. The goal is to limit the memory usage via reduced data storage and computation through fixed per-sample computation cost. A more generic approach includes adopting advanced learning settings that accommodates algorithms with smaller sample complexity. However, this is a broader research topic and is not just limited to on-device learning.

New protocols for data observation — Finally, completely novel approaches are possible that completely change the traditional data observation protocol (like the availability of i.i.d data in batch or online settings). These approaches are guided by an underlying resource-constrained learning theory which captures the interplay between resource constraints and the goodness of the model in terms of the generalization capacity. Compared to the above approaches, this framework provides a generic mechanism to design resource-constrained algorithms for a wider range of learning problems applicable to any method/loss function targeting that problem type.

Challenges
The major challenge in algorithms research is proper software-centric characterization of the hardware constraints and the appropriate use of this characterization for better metric designs. If hardware dependencies are not properly abstracted away, the same model and algorithm can have very different performance profiles on different hardware. While novel loss functions can take such dependencies into account, it is still a relatively new field of study. The assumption in many cases is that the resource budget available for training does not change but that is usually never the case. Our everyday devices are often multi-tasking — checking emails, social media, messaging people, playing videos… the list goes on. Each of these apps and services are constantly vying for resources at any given moment in time. Taking this changing resource landscape into account is an important challenge for effective model training on the edge.

Finally, improved methods for model profiling are needed to more accurately calculate an algorithm’s resource consumption. Current approaches to such measurements are abstract and focus on applying software engineering principles such as asymptotic analysis or low-level measures like FLOPS or MACs (Multiply-Add Computations). None of these approaches give a holistic idea of resource requirements and in many cases represent an insignificant portion of the total resources required by the system during learning.

Theory

Every learning algorithm is based on an underlying theory that guarantees certain aspects of its performance. Research in this area focuses mainly on Learnability — the development of frameworks to analyze the statistical aspects (i.e. error guarantees) of algorithms. While traditional machine learning theories underlie most current approaches, developing newer notions of learnability that include resource constraints will help us better understand and predict how algorithms will perform under resource-constrained settings. There are two broad categories of theories into which most of the existing resource-constrained algorithms can be divided

Traditional Learning Theories — Most existing resource-constrained algorithms are designed following traditional machine learning theory (like PAC Learning Theory, Mistake Bounds, Statistical Query). A limitation of this approach is that such theories are built mainly for analyzing the error guarantees of the algorithm used for model estimation. The effect of resource constraints on the generalization capability of the algorithm is not directly addressed through such theories. For example, algorithms developed using the approach of reducing the model complexity typically adopts a two-step approach. First, the size of the hypothesis class is constrained beforehand to those that use fewer resources. Next, an algorithm is designed guaranteeing the best-in-class model within that hypothesis class. What is missing in such frameworks is the direct interplay between the error guarantees and the resource constraints.

Resource-constrained learning theories — Newer learning theories try to overcome the drawbacks of traditional theories especially since new research has shown that it may be impossible to learn a hypothesis class under resource constrained settings. Most of the algorithms from earlier that assume new protocols for data observation fall in this category of resource-constrained theories. Typically, such approaches modify the traditional assumption of i.i.d data being presented in a batch or streaming fashion and introduces a specific protocol of data observability that limits the memory/space footprint used by the approach. These theories provide a platform to utilize existing computationally efficient algorithms under memory-constrained settings to build machine learning models with strong error guarantees. Prominent resource-constrained learning theories include Restricted Focus of Attention (RFA), newer Statistical Query (SQ) based learning paradigms, and graph-based approaches that model the hypothesis class as a hypothesis graph. Branching programs translate the learning algorithm under resource constraints (memory) in the form of a matrix (as opposed to a graph) where there is a connection between the stability of the matrix norm (in the form of an upper bound on its maximum singular value) and the learnability of the hypothesis class with limited memory. Although such theory-motivated design provides a generic framework through which algorithms can be designed for a wide range of learning problems, to date, very few algorithms based on these theories have been developed.

Challenges
Perhaps the biggest drawback to theoretical research is that while it is flexible enough to apply across classes of algorithms and hardware systems, it is limited due to the inherent difficulty of such research and the need to implement a theory in the form of an algorithm before its utility can be realized.

Conclusion

A future full of smart devices was the stuff of science fiction when we slipped the first iPhones into our pockets. Thirteen years later, devices have become much more capable and now promise the power of AI and ML right at our fingertips. However, these new-found capabilities are a facade propped up by massive computational resources (data centers, compute clusters, 4G/5G networks etc) that bring AI and ML to life. But devices can only be truly powerful on their own when it is possible to sever the lifeline that extends between them and the cloud. And that requires the ability to train machine learning models on these devices rather than in the cloud.

Training ML models on a device has so far remained an academic pursuit, but with the increasing number of smart devices and improved hardware, there is interest in performing learning on the device itself. In the industry, this interest is fueled mainly by hardware manufacturers promoting AI-specific chipsets that are optimized for certain mathematical operations, and startups providing ad hoc solutions to certain niche domains mostly in computer vision and IoT. From an AI/ML perspective, most of the activity lies in two areas — the development of algorithms that can train models under resource constraints and the development of theoretical frameworks that provide guarantees about the performance of such algorithms.

At the algorithmic level, it is clear that current efforts are mainly targeted at either utilizing already lightweight machine learning algorithms or modifying existing algorithms in ways that reduce resource utilization. There are a number of challenges before we can consistently train models on the edge including the need for decoupling algorithms from the hardware, and designing effective loss functions and metrics that capture resource constraints. Also important are an expanded focus on traditional as well as advanced ML algorithms with low sample complexity and dealing with situations where the resource budget is dynamic rather than static. Finally, the availability of an easy and reliable way to profile algorithm behavior under resource constraints will speed up the entire development process.

Learning theory for resource-constrained algorithms is focused on the un-learnability of an algorithm under resource constraints. The natural step forward is to identify techniques that can instead provide guarantees on the learnability of an algorithm and the associated estimation error. Existing theoretical techniques also mainly focus on the space(memory) complexity of these algorithms and not their compute requirements. Even in cases where an ideal hypothesis class can be identified that satisfies resource constraints, further work is needed to select the optimal model from within that class.

Despite these difficulties, the future for machine learning on the edge is exciting. Model sizes, even for deep neural networks, have been trending down. Major platforms such as Apple’s Core/CreateML support the retraining of models on the device. While the complexity and training regimen of models continue to grow, it is within the realm of possibility that we will continue to see a push to offload computation from the cloud to the device for reasons of privacy and security, cost, latency, autonomy, and better personalization.

This article was written with contributions from Sauptik Dhar, Junyao Guo, Samarth Tripathi, Jason Liu, Vera Serdiukova, and Mohak Shah.

If you are interested in a more comprehensive survey of edge learning, read our full paper on this topic.

--

--