The world’s leading publication for data science, AI, and ML professionals.

Accelerated TensorFlow model training on Intel Mac GPUs

A mini-guide on how to train TensorFlow model on MacBook Pro dGPU via TensorFlow PluggableDevice

Photo by Nikolay Tarashchenko on Unsplash
Photo by Nikolay Tarashchenko on Unsplash

TensorFlow introduced PluggableDevice in mid-2021 which enables hardware manufacturers to seamlessly integrate their accelerators (e.g. GPUs, TPUs, NPUs) into the TensorFlow ecosystem. This allows users to enjoy accelerated training on non-CUDA devices with minimal modification in their code. More importantly, hardware manufacturers no longer have to fork and implement their own version of TensorFlow (e.g. AMD ROCm port) and can purely focus on the communication layers between TensorFlow and device-level operations. With the recent public release of macOS Monterey, Apple has added Metal support for the PluggableDevice architecture, hence, it is now possible to train TensorFlow models with the dedicated GPU (dGPU) on MacBook Pros and iMacs with ease (sort of).

In this mini-guide, I will walk through how to install tensorflow-metal to enable dGPU training on Intel MacBook Pro and iMac. In addition, I train a simple CNN image classifier on my MacBook Pro, equipped with an AMD Radeon Pro 560X, to demonstrate the accelerated performance.

Create development environment

I personally prefer miniconda, but other environment managers such as anaconda and virtualenv should also work in a similar fashion.

We first create a new conda environment named tf-metal with Python 3.8

conda create -n tf-metal python=3.8

We then activate the environment

conda activate tf-metal

Install Metal enabled TensorFlow

We have to install the following pip packages: [tensorflow-macos](https://pypi.org/project/tensorflow-macos/) and [tensorflow-metal](https://pypi.org/project/tensorflow-metal). Normally, you can simply do pip install tensorflow-macos tensorflow-metaland Bob’s your uncle. However, you might receive the following error since both packages are built against post-macOS 11 SDK:

ERROR: Could not find a version that satisfies the requirement tensorflow-macos (from versions: none)
ERROR: No matching distribution found for tensorflow-macos

To bypass the version compatibility issue, we need to use the following flag SYSTEM_VERSION_COMPAT=0 along with pip install :

SYSTEM_VERSION_COMPAT=0 pip install tensorflow-macos tensorflow-metal

Both packages should now be installed:

(tf-metal) ➜  ~ pip list
Package                 Version
----------------------- ---------
absl-py                 0.15.0
astunparse              1.6.3
cachetools              4.2.4
certifi                 2021.10.8
charset-normalizer      2.0.7
clang                   5.0
flatbuffers             1.12
gast                    0.4.0
google-auth             2.3.1
google-auth-oauthlib    0.4.6
google-pasta            0.2.0
grpcio                  1.41.1
h5py                    3.1.0
idna                    3.3
keras                   2.6.0
Keras-Preprocessing     1.1.2
Markdown                3.3.4
numpy                   1.19.5
oauthlib                3.1.1
opt-einsum              3.3.0
pip                     21.2.4
protobuf                3.19.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
requests                2.26.0
requests-oauthlib       1.3.0
rsa                     4.7.2
setuptools              58.0.4
six                     1.15.0
tensorboard             2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorflow-estimator    2.6.0
tensorflow-macos        2.6.0
tensorflow-metal        0.2.0
termcolor               1.1.0
typing-extensions       3.7.4.3
urllib3                 1.26.7
Werkzeug                2.0.2
wheel                   0.37.0
wrapt                   1.12.1

Check physical devices in TensorFlow

We can use tf.config.list_physical_devices() to check all available physical devices:

>>> import tensorflow as tf
>>>
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

We can see that, in the case of my 2018 MacBook Pro with the AMD Radeon Pro 560X dGPU, there are two physical devices: a CPU and a GPU.

Similar to using a native device or CUDA device in TensorFlow, we can declare a variable or define operations to run on a specific device using the with tf.device() syntax:

>>> with tf.device('/GPU'):
...     a = tf.random.normal(shape=(2,), dtype=tf.float32)
...     b = tf.nn.relu(a)
...
2021-10-26 12:51:24.844280: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Metal device set to: AMD Radeon Pro 560X
systemMemory: 16.00 GB
maxCacheSize: 2.00 GB
2021-10-26 12:51:24.845013: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-10-26 12:51:24.845519: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
>>>
>>> a
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-1.6457689, -0.2130392], dtype=float32)>
>>> b
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 0.], dtype=float32)>

You can see the print-out during initialization that the Metal device AMD Radeon Pro 560X is being set.

Training a CNN classifier

To demonstrate the training performance with tensorflow-metal against vanilla tensorflow (i.e. on CPU), I have written a script that trains a simply CNN model on MNIST using RMSProp for 50 epochs. Note that I am using TensorFlow Datasets to download MNIST , so please do pip install tensorflow_datasets if you want to run the exact same code.

The following are the training results and Activity Monitor screenshots of the CNN model trained with tensorflow (CPU) and tensorflow-metal (GPU).

Activity Monitor screenshot and CNN training performance on CPU [image by author]
Activity Monitor screenshot and CNN training performance on CPU [image by author]
Activity Monitor screenshot and CNN training performance on GPU [image by author]
Activity Monitor screenshot and CNN training performance on GPU [image by author]

We can see that training on both tensorflow and tensorflow-metal achieved similar training and validation accuracy. Moreover, the CNN model takes on average 40ms/step on CPU as compared to 19ms/step on GPU, ~52% speedup. From the Activity Monitor screenshots, we can also see that the AMD Radeon Pro 560X dGPU is indeed being used by python3.8, with a GPU usage of ~56%.

Thanks to the TensorFlow PluggableDevice architecture, hardware developers can now enable non-CUDA accelerators to work with TensorFlow without the need of forking or porting the existing TensorFlow codebase. Based on our limited experiments, tensorflow-metal seems to work relatively well and seamlessly on Intel Macs with dGPU. Nevertheless, the Metal plugin is still in its early phase of development and there are known bugs (e.g. Adam optimizer is not working currently) that prohibit ML developers from switching to tensorflow-metal workflow yet. Hopefully, as more and more hardware manufacturers start to integrate their products with the PluggableDevice API, we will see better support and more options in AI hardware accelerators.


Other references on TensorFlow PluggableDevice and Apple Metal


Related Articles