
TensorFlow introduced PluggableDevice in mid-2021 which enables hardware manufacturers to seamlessly integrate their accelerators (e.g. GPUs, TPUs, NPUs) into the TensorFlow ecosystem. This allows users to enjoy accelerated training on non-CUDA devices with minimal modification in their code. More importantly, hardware manufacturers no longer have to fork and implement their own version of TensorFlow (e.g. AMD ROCm port) and can purely focus on the communication layers between TensorFlow and device-level operations. With the recent public release of macOS Monterey, Apple has added Metal support for the PluggableDevice architecture, hence, it is now possible to train TensorFlow models with the dedicated GPU (dGPU) on MacBook Pros and iMacs with ease (sort of).
In this mini-guide, I will walk through how to install tensorflow-metal
to enable dGPU training on Intel MacBook Pro and iMac. In addition, I train a simple CNN image classifier on my MacBook Pro, equipped with an AMD Radeon Pro 560X, to demonstrate the accelerated performance.
Create development environment
I personally prefer miniconda, but other environment managers such as anaconda and virtualenv should also work in a similar fashion.
We first create a new conda
environment named tf-metal
with Python 3.8
conda create -n tf-metal python=3.8
We then activate the environment
conda activate tf-metal
Install Metal enabled TensorFlow
We have to install the following pip packages: [tensorflow-macos](https://pypi.org/project/tensorflow-macos/)
and [tensorflow-metal](https://pypi.org/project/tensorflow-metal)
. Normally, you can simply do pip install tensorflow-macos tensorflow-metal
and Bob’s your uncle. However, you might receive the following error since both packages are built against post-macOS 11 SDK:
ERROR: Could not find a version that satisfies the requirement tensorflow-macos (from versions: none)
ERROR: No matching distribution found for tensorflow-macos
To bypass the version compatibility issue, we need to use the following flag SYSTEM_VERSION_COMPAT=0
along with pip install
:
SYSTEM_VERSION_COMPAT=0 pip install tensorflow-macos tensorflow-metal
Both packages should now be installed:
(tf-metal) ➜ ~ pip list
Package Version
----------------------- ---------
absl-py 0.15.0
astunparse 1.6.3
cachetools 4.2.4
certifi 2021.10.8
charset-normalizer 2.0.7
clang 5.0
flatbuffers 1.12
gast 0.4.0
google-auth 2.3.1
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.41.1
h5py 3.1.0
idna 3.3
keras 2.6.0
Keras-Preprocessing 1.1.2
Markdown 3.3.4
numpy 1.19.5
oauthlib 3.1.1
opt-einsum 3.3.0
pip 21.2.4
protobuf 3.19.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.7.2
setuptools 58.0.4
six 1.15.0
tensorboard 2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorflow-estimator 2.6.0
tensorflow-macos 2.6.0
tensorflow-metal 0.2.0
termcolor 1.1.0
typing-extensions 3.7.4.3
urllib3 1.26.7
Werkzeug 2.0.2
wheel 0.37.0
wrapt 1.12.1
Check physical devices in TensorFlow
We can use tf.config.list_physical_devices()
to check all available physical devices:
>>> import tensorflow as tf
>>>
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
We can see that, in the case of my 2018 MacBook Pro with the AMD Radeon Pro 560X dGPU, there are two physical devices: a CPU
and a GPU
.
Similar to using a native device or CUDA device in TensorFlow, we can declare a variable or define operations to run on a specific device using the with tf.device()
syntax:
>>> with tf.device('/GPU'):
... a = tf.random.normal(shape=(2,), dtype=tf.float32)
... b = tf.nn.relu(a)
...
2021-10-26 12:51:24.844280: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Metal device set to: AMD Radeon Pro 560X
systemMemory: 16.00 GB
maxCacheSize: 2.00 GB
2021-10-26 12:51:24.845013: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-10-26 12:51:24.845519: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
>>>
>>> a
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-1.6457689, -0.2130392], dtype=float32)>
>>> b
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 0.], dtype=float32)>
You can see the print-out during initialization that the Metal device AMD Radeon Pro 560X
is being set.
Training a CNN classifier
To demonstrate the training performance with tensorflow-metal
against vanilla tensorflow
(i.e. on CPU
), I have written a script that trains a simply CNN model on MNIST
using RMSProp
for 50 epochs. Note that I am using TensorFlow Datasets to download MNIST
, so please do pip install tensorflow_datasets
if you want to run the exact same code.
The following are the training results and Activity Monitor screenshots of the CNN model trained with tensorflow
(CPU) and tensorflow-metal
(GPU).

![Activity Monitor screenshot and CNN training performance on CPU [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/10/1IjC95nGQ-Tq1teP9lPH2cw.png)

![Activity Monitor screenshot and CNN training performance on GPU [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/10/1SgLrIy1oJWspcI6HhHKPtA.png)
We can see that training on both tensorflow
and tensorflow-metal
achieved similar training and validation accuracy. Moreover, the CNN model takes on average 40ms/step on CPU as compared to 19ms/step on GPU, ~52% speedup. From the Activity Monitor screenshots, we can also see that the AMD Radeon Pro 560X dGPU is indeed being used by python3.8, with a GPU usage of ~56%.
Thanks to the TensorFlow PluggableDevice architecture, hardware developers can now enable non-CUDA accelerators to work with TensorFlow without the need of forking or porting the existing TensorFlow codebase. Based on our limited experiments, tensorflow-metal
seems to work relatively well and seamlessly on Intel Macs with dGPU. Nevertheless, the Metal plugin is still in its early phase of development and there are known bugs (e.g. Adam optimizer is not working currently) that prohibit ML developers from switching to tensorflow-metal
workflow yet. Hopefully, as more and more hardware manufacturers start to integrate their products with the PluggableDevice API, we will see better support and more options in AI hardware accelerators.
Other references on TensorFlow PluggableDevice and Apple Metal
- TensorFlow – PluggableDevice: Device Plugins for TensorFlow
- TensorFlow – GPU device plugins
- Apple – Tensorflow Plugin – Metal – Apple Developer