Complex Web of AI

Phil J. Łaszkowicz
Towards Data Science
15 min readFeb 27, 2020

--

As a follow up to the previous articles on the origins of Swift for TensorFlow (S4TF) and the hardware powering modern AI, this article is part one of two articles focused on machine learning and the Web. Where-as the second part will focus on the Web as a data platform, this article will focus on the emerging tooling for AI, and how the Web is going to become a first-class platform connecting intelligence in a way that has not been possible previously

Introductions

Although developments with AI tooling on the Web are not particularly recent, we’ll start in September 2019, with a workshop held in a hotel in Fukuoka, Japan, as part of the World Wide Web Consortium’s Technical Plenary / Advisory Committee (W3C TPAC) conference. In this meeting sat a group of industry experts with a unique perspective on the emerging technologies that has the potential to bring the Web and AI together to create an unparalleled development platform for intelligent solutions.

Within this meeting sat a member of Microsoft’s Edge team, to his right sat a member of Apple’s Safari development team, then an engineer from Intel (and the founder of the Machine Learning for the Web / WebNN W3C group), and two people from Google, including one of the main experts on TensorFlow.js.

The conversation inside covered a wide array of topics including WebAssembly, hardware design, and compilation toolchains, and included several presentations on related technologies that will inevitably contribute to the Machine Learning for the Web specifications.

Much of the material had been shared before, and was part of a growing wealth of data sharing between the major browser vendors and a wider community of domain experts, and it displays the commitment with which the evolution of the Web as a first-class AI platform is being undertaken.

Finding Harmony with Developers

The technologies underpinning this movement are not entirely focused on the Web, and nor should they be. It is essential that the experience and evolution that’s been fundamental in creating a common and ubiquitous set of machine learning tooling already across platforms, is also shared to make the Web easy to develop for.

In the same way that TensorFlow has multiple permutations focused on specific mediums (TensorFlow, TensorFlow Lite, TensorFlow.js, etc.) high-level machine learning tooling typically need to take on a broader view of requirements before narrowing to specific problem areas, with compatibility and generalisation in approaches imperative to simplifying the entry-point for developers. Consider typical programming paradigms like functional programming (FP) or object-oriented programming (OOP): both are used across a variety of languages to ensure ease-of-adoption by developers, and a familiar set of approaches to problems.

Simplification of Low-to-High Level ML Tooling Separation

This simplification of the approach to machine learning is essential to opening up the accessibility for all developers. Mobile tooling has been getting richer thanks to proprietary tooling from Apple (Core ML) and Google (ML Kit), which both support tools that enable solutions written in Python, with machine learning libraries like TensorFlow, to be ported effortlessly to products written idiomatically in Swift and Kotlin respectively.

In April 2017 Google’s DeepMind team had been developing a high-level abstraction over TensorFlow and announced the open sourcing of this project — called Sonnet — aiming to make it easier for other developers to benefit from a common development model, whilst still using the highly-tuned and powerful features inherent in the TensorFlow ecosystem.

Of course, most of these developer tools continue to be Python-based, but Python is hardly the best tool for writing machine learning models. It lacks static types, support for custom operations, first-class concurrency, and does not scale well: it’s essentially still an interpreted single CPU language, which does not work well for scientific research cases, where C++ is still heavily used. In fact, Python in machine learning is generally just an abstraction of computational operations performed in C++, which needs to be ported to each platform, taking into consideration hardware architecture along the way. This becomes increasingly complicated as we build more and more solutions on edge devices (such as smartphones), and as new operations are added to the ML libraries (which is not straight forward with Python due to the nature of the language). Currently TensorFlow is seeing approximately 15–20% increase of operations annually and already supports thousands: the task of making this ecosysem available for the Web is inevitably and unendingly complex.

Giving Models a Boost

This is where Swift for TensorFlow originally came in. The engineers at Google became so convinced of the capabilities of Swift as the first-class ML language of the future they contributed back to the compilation toolchain adding features like automatic differentiation into the LLVM-based stack, with the goal to allow Swift to perform safe, concurrent, and compiled models, providing developers with a familiar and intuitive experience, along with compile-time checks and debugging during model development.

Simplification of Swift approach to target performance and tooling

However, the seemingly infinite libraries written in Python are not going to be migrated to Swift anytime soon and, more importantly, the academic papers and teaching materials focused on C++ and Python are not going to be migrated to Swift. It’s essential Swift supports importing Python modules and type interoperability (see the Swift snippet below) but, despite this, most research is going to be using Python for some time to come.

import Pythonlet np = Python.import("numpy")
print(np)
let zeros = np.ones([2, 3])
print(zeros)

The lack of performance in Python is why in March 2017 Google announced XLA (Accelerated Linear Algebra), a compiler for TensorFlow that allows for each operation to be optimized for the target architecture, using JIT compilation, for CPUs, GPUs, and of course Google’s TPU. XLA is part of the core of TensorFlow and so it’s available for every model developed with Python. To support additional architectures (for example, the growing range of NPUs on edge devices) XLA supports the addition of new backends thanks to the use of LLVM IR.

TensorFlow Compiler Ecosystem

The fact XLA is part of TensorFlow and powered by LLVM means it’s also available to other TensorFlow-based tooling and, in particular, can be used with Swift for TensorFlow. For interpreted languages like Python however, JIT-based compilation of models is more of a stopgap for languages that were never built for machine learning, rather than an approach for a modern machine learning-ready language. If we really want to push the performance of machine learning in high-level modern languages then the compilation toolchain needs to treat models as code.

Model as Code

One of the exciting developments in Swift for TensorFlow is the support from LLVM to support Swift compilation of models as code (and LLDB for debugging models through tools like Jupyter).

Optimizing code during compilation is achieved through intermediate representation (IR). Swift has this capability thanks to Swift SIL, making it a blazingly fast language and, when combined with type safety and concurrency, it makes it rapid and easy to develop production-ready code with a minimal footprint. Swift for TensorFlow is oriented around bringing that same type of modern and performant development available when writing programs, to machine learning, and making machine learning accessible to every developer, no matter how deep or shallow the solution needs to be.

Swift achieves this with first-class language support for automatic differentiation (AD). The following is a common example displaying how to import a C function and make it differentiable in Swift (yes, Swift can import C libraries as well as Python).

import Glibc

func sillyExp(_ x: Float) -> Float {
let 𝑒 = Float(M_E)
print("Taking 𝑒(\(𝑒)) to the power of \(x)!")
return pow(𝑒, x)
}

@differentiating(sillyExp)
func sillyDerivative(_ x: Float) -> (value: Float, pullback: (Float) -> Float) {
let y = sillyExp(x)
return (value: y, pullback: { v in v * y })
}

print("exp(3) =", sillyExp(3))
print("𝛁exp(3) =", gradient(of: sillyExp)(3))

This leads us to a world where the compilation toolchain is not just about optimizing human-readable instructions during conversion into executable programs, but which fully supports optimizing machine learning operations into models for the target platform, safely and with a high-level of abstraction.

This article isn’t about the state of Swift though, but the state of AI for the Web, so let’s go back to where we are currently and how the Python community is tackling the performance deficiencies.

When it comes to AD the Python community have developed a variety of tools, most prominently the now defunct Autograd developed at Harvard. Again, this is more of a patch on the deficiencies of Python than a major leap forward, but it’s part of an impressive ecosystem of Python optimization tools that become essential when you’re trying to squeeze the most out of your models.

Autograd became obsolete when the core developers began contributing to JAX in 2017, an Autograd and XLA-based tool for taking the improvement of of Python-based machine learning model performance even further. Although there’s often an outsider view that JAX is competing with Swift for TensorFlow, the reality is both are likely to remain strong options for performant machine learning and will continue to co-exist happily thanks to the common ground they share.

Let’s Get Poetic for a Moment

Earlier in this article I mentioned the need for high-level abstractions to simplify the entry-point for new developers. Sonnet was perfectly suited for TensorFlow abstractions, but since then DeepMind has augmented their use of TensorFlow with the aforementioned JAX for the performance benefits it brings, so Sonnet was no longer relevant for their use.

Haiku is DeepMind’s project for providing OOP-like concepts on top of JAX and essentially replaces Sonnet as a high-level API. It simplifies using JAX, providing a programming approach that is familiar and yet has access to the pure performance capabilities of the JAX library.

Right now, if you’re trying to develop models with ease and performance, than combining Haiku and JAX is a strong approach, and benefits from the maturity of the existing Python ecosystem.

Haiku is still considered alpha but, when compared with the current state of Swift, it can be considered mature enough to use today, with just some mild caution.

So far I’ve covered the current state of AI development in terms of performance tooling and the developer experience (DX) but how does any of this relate to the Web?

Caught in the Web

In 2018 Google announced TensorFlow.js; the popular machine learning library for JavaScript and the Web. At the time it was a good way to learn machine learning, especially if you were a web developer, but not a great tool for pushing the boundaries of machine learning or AI research.

For example, TensorFlow.js lacked support for many operations, implicit browser support, and inherently hardware support. It was a good hobby tool, but serious data scientists were not going to be using it as a replacement for cloud-provided machine learning tools.

TensorFlow.js has progressed rapidly since then, and the experimentations are growing but, when compared with Core ML and ML Kit for mobile models, or TensorFlow Lite for edge devices, TensorFlow.js is not even close. During the announcement I never expected it to be, until I became involved in the Machine Learning for the Web group.

Let’s first discuss where TensorFlow.js sits, as there has been a lot of hyperbole, mostly by web developers, that this will make JavaScript (and the Web) the best place to develop and deliver models. This is far from the truth: TensorFlow.js is closer to TensorFlow Lite and Core ML and aims to solve the same problems.

If you want to make your mobile or edge solutions smarter, proactive, or personalised, you don’t resort to sending every bit of data to your cloud ML service and then throw back instructions. You develop models on the target devices, and utilize the hardware provided — like GPUs or neural processing units (NPUs) — and OS-embedded tooling to deliver computer vision, speech recognition, or personalization features to your end-users. There is no equivelent for the Web. It’s a platform without a production-ready ML ecosystem.

TensorFlow.js is about bringing the same tools mobile and edge developers are exploiting to the Web so the end user experience can be enhanced in much the same way. This inherently brings additional benefits to end-users like privacy-first solutions (no more over-zealous sending of user data to cloud services), and provides even more incentives for developers to develop progressive web apps (PWAs) with full offline capability.

We’re not there with TensorFlow.js though, and that’s not due to insubstantial roadblocks that will inevitably effect any attempt to get machine learning to be equal on the Web with mobile and edge devices.

Weaving the Complex Web

Let’s start by discussing how TensorFlow.js actually works. Firstly it has to implement the same operations as other TensorFlow libraries. This means supporting thousands of operations, exponentially increasing each year at a rate that is difficult to keep up with. And TensorFlow.js isn’t simply a translation of models in Python and C++ into JavaScript (because that would be simple, right?), but it requires operation support from the hardware, via every browser.

That currently works in a fairly hacky way by using WebGL to access the device’s GPU-enabled acceleration to run instructions. That’s right, the Web’s GPU language, typically used for executing 3D transformations is being used not-as-intended to run machine learning operations. This makes sense: GPUs have been doing this at a lower level for a long time, but it’s exciting to see this happen through the browser.

TensorFlow.js Architecture (courtesy of TensorFlow.js community)

Why does this seem so hacky then? Well the world has moved on. Pushing GPUs to do ML arithmetic made way for optimizations through compilation tooling (e.g. XLA via LLVM) and custom hardware in the form of neural processing units (including Google’s TPU and the various custom ARM NPUs running on all of the latest Android and iOS devices). It shouldn’t be necessary anymore to utilize WebGL hacks.

Let’s look at the tooling. Earlier we spoke about how XLA and LLVM provide optimizations for Python and Swift models by taking the operations and delivering compiled machine code that is optimized for the target platform (hardware and software). With Swift this is accomplished through Swift SIL, an intermediate representation (IR) of the Swift program that enables a program to be re-interpreted during the build process. Python doesn’t have the equivalent, and so it relies on just-in-time (JIT) optimisations — they’re excellent boosters for Python models, but it’s a lot of work, and still falls short of the statically-typed compiled language, due to the lack of guarantees on which branch of the operation tree will execute.

Swift Compiler Infrastructure

XLA may provide a 15% improvement on Python models, where-as Swift for TensorFlow can be 400% faster.

JavaScript is looking a lot closer to Python than to Swift in respect to it being loosely typed and interpreted, so the XLA route looks like the logical way for browser-based ML. Except XLA creates machine code, and a lot of the benefits of TensorFlow.js is that it can run in the browser.

How can we compile JavaScript into optimized machine instructions the browser can understand? WebAssembly, right? Actually, no. Simply put, WebAssembly operates in its own memory space and cannot interact with your model. Undoubtedly it will provide performance improvements for various complex end solutions that use ML — in fact TensorFlow.js has a WASM back-end, and other projects like ONNX have their own WASM project — but these operate indirectly via the CPU or WebGL. They are, in general, faster than pure JavaScript, so should be used every time you’re developing models, but they still have a long way to go to match the potential of IR optimizations with native hardware support.

This is where the Web Neural Network API (WebNN) discussions came in. To make TensorFlow.js equal to TensorFlow Lite, Core ML, and ML Kit, it means browser vendors need to be onboard. Consider them at the same operating level as Android or iOS in model execution. This requires standardisation (which is never quick), and cooperation (which even with best intentions can be complex).

Let’s remember, most machine learning development is done with an inefficient language (Python) and then converted to each target platform via tools like Core ML and TensorFlow Lite, meaning every developer has to know Python, regardless of what they’re developing for, if they want to make the most of machine learning. This is the case despite the fact modern devices have implicit support for machine learning operations inbuilt, but inaccessible directly by languages like Swift and Kotlin.

Tearing It Back Down

The XLA optimization tool is currently capable of optimizing models built with JAX, Julia, and PyTorch. It does this through another IR component called HLO (or High-Level Optimizer). This works through an extensible framework for supporting various backends for various architectures, with LLVM IR capable of optimizing for a variety of CPUs and GPUs. Essentially XLA HLO is another abstraction for IR-based optimizations.

Taking a step back it’s clear that Swift SIL, XLA, and LLVM IR are all approaching the same problems. Rust also has MIR and HIR, and utilizes LLVM IR, along with Go. If we can combine approaches from XLA and LLVM IR in a modern language we would be capable of developing models capable of C++ performance, easily, and safely. Swift for TensorFlow promises this, but what about other languages?

Various Compiler Infrastructure supported by LLVM IR

If we tried to develop a unified compiler toolchain that supported both a programming language, and machine learning models, we’d quickly become unstuck. The dependency on hardware evolution and the differences between numerical abstractions and ML development means the scale of these two issues are entirely at odds. Abstracting to an intermediate representation for ML modeling is vital for this to work.

This is where MLIR, or multi-level intermediate representation, comes in. MLIR is Google’s attempt to create an extensible toolchain, based on LLVM IR, to allow for optimized compilation of models. Currently MLIR supports various dialects, including TensorFlow IR, TensorFlow Lite, XLA HLO, and LLVM IR, but it provides a basis for adding further dialects in the future. Google donated MLIR to LLVM so it can benefit from the wealth of experience the LLVM community has in developing a powerful instruction optimization infrastructure.

Adding XLA HLO to LLVM IR

A dialect capable of optimizing a common set of instructions for the browser may not be that far away then. WebNN has been looking at use cases to take this further, and the essential operations needed to make this happen, without focusing too much on MLIR at this stage.

This needs to go beyond CPU and GPUs though. In the future it’s not infeasible for various devices to have diverse NPUs for specific tasks, such as speech recognition or computer vision. Operating systems already make intelligent decisions on where to pipe specific instructions, but the Web still only has a naïve understanding of CPUs and GPUs.

Browser vendors are already fixing this by allowing a preferred processor architecture to be specified for a set of instructions, and if unavailable a fallback will be selected. This will enable NPUs to be used in the browser, and combined with optimized operations from MLIR could mean near-native performance of machine learning models directly from web applications, finally giving web applications the same level of access to ML tooling that mobile devices have been benefiting from.

We’re still a long way from this because the Web is in fact not a platform. It’s an ecosystem. It is easier to describe browsers as platforms (i.e. Chrome / Chromium, Firefox / Servo, Safari / WebKit, etc.) for web applications, but the interoperability of those standards relies on cooperation, and for machine learning it requires careful consideration of compiler technologies and the very different realities of hardware roadmaps.

This is an ongoing discussion, and is potentially the biggest change in the Web for some time. It will bring the promise of privacy to ML-powered web solutions, and a new era of web applications and portable products that are only just finding their way onto modern mobile devices.

The current way of optimizing models using interpreted IR is inefficient due the possible permutations of graph routes. Machine learning is already power hungry, so reducing the energy consumption needed to execute models is an essential benchmark to aim for in any future ML tooling. This can be done by moving to C++, or by moving the majority of ML development to languages that can provide similar performance with easier points-of-entry allowing everyone access to tools that enable efficient model training and execution. The latter would be preferred.

MLIR and WebNN will improve performance, tooling, and options for machine learning on and off the Web, but there are already tools in place for high-level machine learning that are yet to be exploited by the majority of developers. The approach often taken is to blame the algorithms for performance issues, when most popular libraries support optimizations for specific architectures already. Using features like XLA (and JAX) can already make a difference in reliable production performance of training and model execution. Using WebAssembly where possible will also provide advantages on the Web.

MLIR and WebNN are evolving, but it’s imperative that developers not only learn layer and algorithm APIs of the libraries they’re using, but also how to extract the performance gains from the various tooling provided. This means understanding the target hardware and how to switch it on.

Although most of this article focused on TensorFlow, part of the effort going into the tools mentioned is about unification and commonality across the high-level tools to simplify research and development gains. This means projects like PyTorch and ONNX are heading in the same direction and there’s a level of interoperability and compatibility that should make the tool selection much easier for developers.

--

--