The world’s leading publication for data science, AI, and ML professionals.

Training Speed of TensorFlow in macOS Monterey

GPU training in M1 SoC comparing with results in Quadro RTX6000 and estimation in M1 Max SoC

Image by Author
Image by Author

[Update] Oct. 30, 2021

I uploaded a set of Python codes and image data for training benchmark to my GitHub. See the bottom of this article for more details.


Background

The new OS, macOS Monterey, has come! I was waiting for this new OS for a long time because Gpu training (= faster computation) of TensorFlow/Keras models would be officially supported. This means that my deep learning codes stored in a Windows workstation will be alive, literally alive in macOS machines as well.

But I do not intend to transfer my ecosystem to Mac completely right now. I know that my MacBook Air M1 (I use this for work, study, fun and private things) is not tough enough for such heavy roles. Of course, I’m surely satisfied with its appropriate power and efficiency as an entry laptop in the daily use. Moreover, as described in my past article "Apple Neural Engine in M1 SoC Shows Incredible Performance in Prediction", M1 demonstrates incredible prediction (inference) speed in image segmentation tasks required for my study. It was impressive that M1 SoC defeated nVidia Titan RTX in some cases.

However, apart from prediction, training tasks are complicated and heavy. Indeed, it sometimes takes several days (not minutes or hours!) for training using hundred-thousand images in the Windows workstation.

In this article, I will show my trials, GPU training of a TensorFlow/Keras model on Apple’s M1 SoC. I compared the speed with that of Quadro RTX6000 and estimation in M1 Max SoC.

Now, TensorFlow for macOS supports GPU training in Monterey!


Methods

Training tasks of image segmentation on CPU and GPU in M1 SoC were performed. I followed official installation steps of TensorFlow for macOS Monterey. I used the same code in my Windows workstation with Quadro RTX6000, one of the nVidia’s high-end GPUs, for comparison. Additionally, I estimated how fast the speed would become in M1 Max SoC. My estimation was made in a simple manner, dividing M1’s computation time by 4, because M1 has 8-core GPU and M1 Max has 32-core one.

Table 1. Characteristics and conditions / Image by Author
Table 1. Characteristics and conditions / Image by Author

Table 1 shows the test conditions, and the following screen movie is the training scene on Macos Monterey.


Results

As expected, GPU training is two times faster than CPU training in M1. The training speed is 328 milliseconds for a greyscale image with the size of 512×512 pixel. This leads to nine hours six minutes for hundred-thousand images. The screen movie shows M1 were using the full range of its GPU power.

On the other hand, RTX6000 shows its power in this field, the field of computer graphics, CAD, numerical simulation, and deep learning. Especially, AMP (Automatic Mixed Precision), the computation technique where both 32- and 16-bit floating point numbers are used for the higher speed and lower memory consumptions than single 32-bit precision, shows an effective gain. M1 SoC must accept ten to twenty-times behind (I never mean to say it’s slow).

The behind in M1 Max is estimated to improve two to six-times by its 32-core GPU (this is my dreamful estimation).

The following tables show the actual results in M1 SoC, and my estimation in M1 Max, respectively.

Table 2. Training speed in the first two epochs / Image by Author
Table 2. Training speed in the first two epochs / Image by Author
Table 3. Estimation of training speed with 32-core GPU in M1 Max SoC / Image by Author
Table 3. Estimation of training speed with 32-core GPU in M1 Max SoC / Image by Author

Discussion

I was relieved with the results. M1 SoC has "only" ten to twenty-times behind. This Apple’s entry silicon consumes 10 or 15W electricity, while RTX6000 has a big stomach for the maximum power of 295W!

In this context, this comparison may be meaningless.

However, in M1 Max, this is meaningful. In fact, the result (again this is estimation!) seems to be good as a laptop chip, but the new SoC is made for professional use, Apple says. The word "professional use" means that M1 Max/Pro and the other high-end chips are supposed to be used for the same, at least similar, purposes and/or targets in the professional situations. Machines can never say "it’s too heavy for me" in any moments in their workload. Such users will expect the same performance among them even if it is a laptop.

In this context, M1 Max may not be a satisfactory choice for me.

TensorFlow for Windows and Linux machines have been developed for years. nVidia GPUs and CUDA library provide strong supports for the performance. I think that the TensorFlow implemented for Apple Silicon has not been optimized enough yet. In fact, Core ML and Metal APIs, Apple’s pure APIs for high performance computing, seem to use all the CPU, GPU and ANE (Apple Neural Engine) for their heavy computation. The above movie obviously reveals that TensorFlow on Mac uses GPU only. I believe that Integrated usage of various kinds of cores are the specific advantage of Apple’s SoC. There will be wide spaces left for optimization.


Conclusions

  • I can train my TensorFlow/Keras models on GPU in M1 SoC using my smiley MacBook Air. The training speed is two times faster than that on its CPU.
  • However, the SoC does not have a magical power but depends on its Watt consumption comparing with a monster RTX6000.
  • The estimated result of M1 Max SoC seems to be good as a laptop, but not enough for my usage = deep learning with hundred-thousand images.
  • The situation may change in the future depending on optimizations of tasking among cores in TensorFlow for macOS.

Anyway, any results, in honest, I want a new M1 Max MacBook Pro! :>


[Update] Oct. 30, 2021

I uploaded a set of Python codes and image data to my GitHub.

NOTE: The files in GitHub are different from the original codes used in the above article because the originals contain research information.

The following Table 2+ shows the results obtained from this released set.

Table 2+. Training speed in the first two epochs / Image by Author
Table 2+. Training speed in the first two epochs / Image by Author

Related Articles