In my previous article, I compared M2 Max GPU with Nvidia V100, P100, and T4 on MLP, CNN, and LSTM training. The results show that M2 Max can perform very well, exceeding Nvidia GPUs on small model training. But as stated in the article:
[…] these metrics can only be considered for similar neural network types and depths as used in this test.
So this second part tests bigger models, focusing on CNN only and comparing M2 Max with the most powerful GPU previously tested: the Nvidia V100.
Another point considered in this test is memory management. While the Nvidia GPU is losing a lot of time in memory transfer, the M2 Max GPU has direct access to the unified memory, so it doesn’t require any delay before training the model. Since, as the results shown in the previous article, this makes a big difference for small models trained on a small number of epochs, we remove this effect for bigger models to compare the pure training time only.
For this purpose, we train models on ten epochs, but instead of using the total training time, we capture and average the step’s training duration from the second epoch to the last one. This removes the initialization and memory transfer overhead, which is also partially reflected in the first epoch.
And the last, but nowadays most crucial point, is the energy consumed by the GPUs to train a big model. As we will show here, this is where M2 Max is a real game changer.
In this article, you will find the following tests:
- Training four custom CNN ranging from 122,570 to 1,649,482 parameters on CIFAR-10¹ with batch size ranging from 32 to 1024
- Training ResNet50 model on CIFAR-10 with batch size ranging from 32 to 1024
Then, in the two cases, I will compare:
- the raw training performances (epoch duration in milliseconds)
- the energy consumption per epoch
- the energy efficiency ratio between the two GPUs
Setup
This chapter is a quick reminder of the previous article.
Let’s first compare the M2 Max specs to those of M1.

This M2 Max has 30 GPU cores, so we estimated the 10.7 TFLOPS from the 13.6 TFLOPS of the 38 GPU cores version.
Here are the test environment and TensorFlow version used. Note that compared to the previous article, Google Colab upgraded the TensorFlow version to 2.15.

The environment on M2 Max was created using Miniforge. The installed packages include only the following ones:
conda install python=3.10
pip install tensorflow-macos==2.12
pip install tensorflow-metal==0.8.0
conda install pandas
To enable GPU usage, install the tensorflow-metal package distributed by Apple using TensorFlow PluggableDevices. Note that Metal acceleration is also available for PyTorch and JAX.
Apple says
With updates to Metal backend support, you can train a wider set of networks faster with new features like custom kernels and mixed-precision training.
Models
The training is conducted on four customized Convolutional Neural Networks (CNNs) and the ResNet50 model.
The four custom CNN models are defined as follows.
def create_models():
models = []
models.append(tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32,(3,3),activation = 'relu',input_shape=train_images.shape[1:]),
tf.keras.layers.MaxPooling2D((2,2)),
tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
tf.keras.layers.MaxPooling2D((2,2)),
tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64,activation='relu'),
tf.keras.layers.Dense(10)
]))
models.append(tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32,3,activation = 'relu',padding='same',input_shape=train_images.shape[1:]),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(64,3,activation = 'relu',padding='same'),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(128,3,activation = 'relu',padding='same'),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64,activation='relu'),
tf.keras.layers.Dense(10)
]))
models.append(tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32,3,activation = 'relu',padding='same',input_shape=train_images.shape[1:]),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(64,3,activation = 'relu',padding='same'),
tf.keras.layers.Conv2D(64,3,activation = 'relu',padding='same'),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(128,3,activation = 'relu',padding='same'),
tf.keras.layers.Conv2D(128,3,activation = 'relu',padding='same'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64,activation='relu'),
tf.keras.layers.Dense(10)
]))
models.append(tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64,7,activation = 'relu',padding='same',input_shape=train_images.shape[1:]),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(128,3,activation = 'relu',padding='same'),
tf.keras.layers.Conv2D(128,3,activation = 'relu',padding='same'),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(256,3,activation = 'relu',padding='same'),
tf.keras.layers.Conv2D(256,3,activation = 'relu',padding='same'),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128,activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(64,activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10)
]))
return models
_The final layer uses no softmax activation since the categorical cross-entropy loss is defined with fromlogits=True.
They are using the following optimizer and loss function.
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
The ResNet50 model is instantiated as follows:
model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=10)
The number of parameters for custom CNN and ResNet50 are :

Results
The results below are based on measurements that eliminate the overhead of memory transfer and initialization. This overhead gives M2 Max a significant advantage in training on a small number of epochs due to its Unified Memory architecture.
This comparison reflects more accurately the global expected performance difference for model training on typically more than 50 epochs.
Custom CNN Performances
Let’s first compare the average training time per epoch of M2 Max and V100.

The average training time per epoch for models 0, 1, and 2 is very near, but V100 always has a slight advantage.
For model 3, the V100 clearly beat the M2 Max performances.
Let’s now compare how many times V100 is faster than M2 Max.


V100 is always faster than M2 Max on every CNN model and every tested batch size, ranging from 1.24 to 2.28 times faster on average.
One will ask :
Model 0 is the same as the one used in the previous article, but the results are different here; why?
As explained at the beginning, the previous article was considering the whole training time, including initialization and memory transfers, so the whole fit function duration. It was also only considering a very small number of epochs (5 epochs, but this overhead can gives an advantage to M2 Max up to 10 or 20 epochs). Here the measurement is more suitable to evaluate training for a bigger number of epochs and also consider much larger models.
While the difference may appear disappointing, it is actually quite small. Furthermore, the results change significantly when we take energy consumption into account.
Custom CNN Energy consumption
When running the performance test, we also measure the Gpu energy consumption (details at the end of the article).
The following chart shows the M2 Max GPU energy consumption in Watts for each model/batch size training combination. The first spike corresponds to the model 0 with a batch size of 32. The fifth spike corresponds to the model 0 with a batch size of 1024. The last spike is the model 3 for a batch size of 1024.

The following chart shows the V100 Max GPU energy consumption in Watts for each model/batch size training combination.

We can easily observe that V100 energy consumption is always much higher than M2 Max, regardless of the training combination.
In our data, we can also observe that when the GPU does nothing, the energy consumption is the following:
- V100 : 24W to 25W
- M2 Max : 16mW to 33mW
So when doing nothing, just by having the computer running, V100 consumes about 24W while M2 Max consumes 24mW on average, so one thousand times less. And we must also note that V100 is running on a server with no GUI and no screen to manage, while M2 Max manages a 5 K Retina Display.
After isolating the spikes and averaging their respective energy consumption, I used it to compute the average energy consumption of an epoch in Joules.
As a reminder, a Watt is the unit of power equal to one Joule per second. I don’t use the kWh as this is sometimes poorly understood (because of the presence of the "hour" term). Moreover, kWh is used more as a billing unit, while Joules is used more as a physical unit that is easier to understand.
Here are the results of these measures.

We observe that :
- V100 is consuming much more energy than M2 Max in any case
- The smaller the batch size, the more significant the consumption difference is
- M2 Max shows a much more stable consumption regardless of the batch size compared to V100
And here is how many times V100 consumes more energy than M2 Max.

Depending on the model and batch size, V100 consumes between 2 and 14 times more energy than M2 Max. The bigger the model, the smaller the difference is.
The following table gives the exact ranges.

M2 Max consumes less energy, but we previously observed that V100 is also faster on these models. One can argue that consuming two times more energy per epoch but being three times faster makes the V100 globally more efficient.
So, I computed an energy efficiency ratio to determine how many times the M2 Max is more efficient than V100. This is equivalent to "how many times V100 is less efficient than M2 Max" as represented in the plots for convenience.
Its formula is the following.

The bar chart below displays the ratio.

These results clearly show that M2 Max is much more efficient than V100.
- For models 0,1 and 2, M2 Max is always much more energy efficient than V100 for every batch size
- For the batch size of 32, M2 Max is always much more energy efficient than V100 in every case
- The only case where V100 becomes slightly more efficient is for model 3 for batch sizes of 512 and 1024
The following table gives the exact ranges.

On average, M2 Max is 1.21 to 6.84 times more efficient than V100 and can even be more than 11 times more efficient with model 0.
ResNet50
The previous results show that the difference between M2 Max and V100 relative to the energy consumption per Epoch and efficiency decreases when the model size increases.
Let’s verify it by performing the same tests with ResNet50, a much bigger model with more than 23 million parameters.
Let’s first compare the average training time per Epoch of M2 Max and V100.

The average training time per Epoch is very near, but V100 always has a small advantage for batch sizes up to 128, which becomes much more significant from a batch size of 512.
Let’s now compare how many times V100 is faster than M2 Max.

The following table gives the exact ranges.

V100 is always faster than M2 Max for every tested batch size, ranging from 1.13 to 3.27 times faster on average. But we also note that for the default TensorFlow batch size of 32 and the commonly used batch size of 64, the advantage of V100 is tiny.
Energy consumption
When running the performance test, we also measure the GPU energy consumption (details at the end of the article).
The following chart shows the M2 Max and V100 GPU energy consumption in Watts for each batch size.

It is not surprising that V100 consumes more energy than M2 Max. As the batch size increases, both models consume more power, but with a simultaneous decrease in training duration, the overall energy consumption for training decreases.
Let’s now compare the Joules per epoch consumption.

The V100 consumes much more energy than M2 Max per Epoch for batch sizes up to 128. This difference becomes smaller for batch sizes of 512 and 1024.
And here is how many times V100 consumes more energy than M2 Max.

Depending on the batch size, V100 consumes between 1.5 and 7.5 times more energy than M2 Max.
The following table gives the exact ranges.

And now, let’s compare their energy efficiency.

These results clearly show that M2 Max is much more efficient than V100 up to batch size of 128, then V100 becomes more efficient for batch sizes of 512 and 1024.
The following table gives the exact ranges.

On average, M2 Max is 2.7 times more efficient than V100 and can even be more than 6.7 times more efficient.
Conclusion
The previous article compared GPU performances for the whole fit function processing, including model initialization and data loading for small models on a few epochs. In this case, M2 Max was faster than V100. For day-to-day small experiments, M2 Max is a good option for training deep learning models.
The question was: what’s happened for bigger models and a high number of epochs? And what about the energy consumption?
This article compared the performance and energy consumption of M2 Max GPU and V100 for bigger custom models and ResNet50 fully trained on CIFAR-10. To only consider the pure GPU computing part, instead of the complete fit function, we measured the duration of each epoch, removing the first one that is always longer and then computing the average training time per epoch. This procedure removed the memory exchange effect and the difference between CPUs in the two environments.
On the pure performance side, the results showed that :
- V100 is always faster than M2 Max
- The bigger the model, the bigger the difference, especially for batch sizes of 512 and 1024
- But surprisingly, when the model size increases a lot, comparing custom model 3 (1.6M parameters) with ResNet50 (23M parameters) for batch sizes of 32 and 64, the difference between the two GPUs becomes much smaller with ResNet50 where V100 is about 1.13 times faster only compared to 1.7 and 2.1 times faster for model 3
These results alone make us think V100 is more interesting than M2 Max for training models, but the perspective changed when considering the energy consumption.
We measured the number of Joules per epoch consumed by each GPU for each model and batch size. We also computed an energy efficiency ratio to consider the training time saving related to higher consumption.
- M2 Max consumes between 1.5 and 14 times less energy than V100, depending on the model, including ResNet50 and batch size. This difference decreases when the batch size increases.
- M2 Max is always much more efficient for the smallest models than V100.
- M2 Max is much more efficient than V100 up to a batch size of 128 on ResNet50.
- For custom models, on average, M2 Max is 1.21 to 6.84 times more efficient than V100 and can even be more than 11 times more efficient with model 0.
- For RestNet50, on average, M2 Max is 2.7 times more efficient than V100 and can even be 6.7 times more efficient.
M2 Max is a perfect option for a personal Green AI setup thanks to its very low energy consumption. It also stays silent and cold during the training. An Intel/Nvidia configuration in a room is a different story regarding noise and heat.
More generally, a new trend is there. For Green Supercomputing, ARM CPUs like the SiPearl Reha are much more modern and better deals than the old Intel x86 to build scalar nodes. Even Nvidia is now entering the ARM world with Grace to offer a low-energy alternative.
So, as a conclusion
- For people who need pure speed regardless of the impact on the planet, Nvidia will be faster, but only by a minimal magnitude compared to the difference in energy consumption.
- For people concerned with Eco-Friendly or Green AI and those waiting a bit more to get the training done, ARM-based architectures like M2 Max are the way to go.
A final word about speed vs energy consumption: Being faster is only essential when a training job (single training or hyper-parameters search) lasts days or weeks. But in these cases, in our research, we generally run our jobs on Jean-Zay Supercomputer rather than personal computers.
In most personal computer usage, model training lasts a maximum of a few hours for a job. So the question is: can I wait for 2 hours instead of 1 hour for my results? Knowing my 2-hour job consumes much less energy.
I’m more concerned with Eco-Friendly AI than pure speed. So my choice is obvious: I’m going to ARM, with M2 Max for small training jobs, but also for inference on some production systems that don’t require GPUs (instances with Ampere M128–30), and maybe soon, even for the jobs requiring a Supercomputer (SiPearl), knowing that in this case the GPU nodes will still be NVidia based, at least for the moment.
How to capture GPU power consumption
Power consumption for Nvidia GPUs or Apple M can be captured through command-line tools.
I used __ th_e nvidia-sm_i command line for Nvidia, and for M2 Max, I used the MacOSX _powermetric_s command that displays many detailed CPU and GPU statistics, including their energy consumption.
It consists of calling the command every second and capturing the output to a file.
Nvidia power capture
For Nvidia, on Google Colab Pro, I first create a shell script to capture the V100 energy consumption every second:
# create shell file
!echo '# !/bin/bash' > capture_nvidia_smi.sh
!echo 'nvidia-smi --query-gpu=timestamp,power.draw --format=csv --loop-ms=1000 > v100_big_models_gpu_power.txt' >> capture_nvidia_smi.sh
!chmod u+x capture_nvidia_smi.sh
I run it and capture its PID:
!nohup ./capture_nvidia_smi.sh &
!pgrep -a nvidia
Then, I start the whole training sequence:
!python v100-performance-benchmark-big-models.py | tee v100_performance_benchmark_big_models.txt
The tee command allows me to capture the training output to a file, which is useful for calculating the average epoch duration. At the same time, it displays the output to the notebook so I can monitor the progress. Once the process is completed, I use the kill command to stop the energy consumption capture. I find the PID for the process using the pgrep command from the previous step.
!kill -9 795
The energy consumption file looks like this:
2023/12/17 18:49:17.547, 230.71 W
2023/12/17 18:49:18.548, 218.37 W
2023/12/17 18:49:19.550, 172.18 W
2023/12/17 18:49:20.551, 223.94 W
2023/12/17 18:49:21.553, 227.75 W
Then I parse it (one file per batch size) to create a dataframe with time and Watt.
power_df_v100 = pd.read_csv(f'v100_resnet50_gpu_power_{batch_size}.txt')
power_df_v100.columns=['timestamp','Watts']
power_df_v100['timestamp'] = pd.to_datetime(power_df_v100['timestamp'])
power_df_v100['Watts'] = power_df_v100['Watts'].apply(lambda x: float(x[:-2]))
M2 Max power capture
For the M2 Max, I run this first command that captures the GPU energy consumption every second:
sudo powermetrics -s gpu_power -i1000 -n-1
Then, I run the whole training from a Python file with a tee like previously.
At the end of the training, I stop the power capture from the shell with Ctrl-C. Here is an example for the ResNet50 test (one file per batch size), where powermetrics displays the power in mW:
*** Sampled system activity (Sun Dec 17 12:56:33 2023 +0100) (1003.00ms elapsed) ***
**** GPU usage ****
GPU HW active frequency: 1398 MHz
GPU HW active residency: 100.00% (444 MHz: 0% 612 MHz: 0% 808 MHz: 0% 968 MHz: 0% 1110 MHz: 0% 1236 MHz: 0% 1338 MHz: 0% 1398 MHz: 100%)
GPU SW requested state: (P1 : 0% P2 : 0% P3 : 0% P4 : 0% P5 : 0% P6 : 0% P7 : 0% P8 : 100%)
GPU SW state: (SW_P1 : 0% SW_P2 : 0% SW_P3 : 0% SW_P4 : 0% SW_P5 : 0% SW_P6 : 0% SW_P7 : 0% SW_P8 : 0%)
GPU idle residency: 0.00%
GPU Power: 40697 mW
*** Sampled system activity (Sun Dec 17 12:56:34 2023 +0100) (1003.19ms elapsed) ***
**** GPU usage ****
GPU HW active frequency: 1398 MHz
GPU HW active residency: 99.98% (444 MHz: .02% 612 MHz: 0% 808 MHz: 0% 968 MHz: 0% 1110 MHz: 0% 1236 MHz: 0% 1338 MHz: 0% 1398 MHz: 100%)
GPU SW requested state: (P1 : 0% P2 : 0% P3 : 0% P4 : 0% P5 : 0% P6 : 0% P7 : 0% P8 : 100%)
GPU SW state: (SW_P1 : 0% SW_P2 : 0% SW_P3 : 0% SW_P4 : 0% SW_P5 : 0% SW_P6 : 0% SW_P7 : 0% SW_P8 : 0%)
GPU idle residency: 0.02%
GPU Power: 41935 mW
Then I simply parse it (one file per batch size) to create a dataframe with time and Watt.
with open(f'm2_resnet50_gpu_power_{batch_size}.txt') as f:
gpu_power_strings = f.readlines()
gpu_power_timestamps = [str(datetime.strptime(x[33:53],'%b %d %H:%M:%S %Y')) for x in gpu_power_strings if 'Sampled system activity' in x]
gpu_power_list_mW = [int(x[11:-4]) for x in gpu_power_strings if 'GPU Power' in x]
power_df_m2 = pd.DataFrame({
'timestamp': gpu_power_timestamps,
'Watts':np.array(gpu_power_list_mW)/1000
})
Then, I filtered the data to keep the highest consumption zone corresponding to where epochs are running.
M2 Max is very stable, so averaging the consumption does not require adjustments; we can use a hard trigger level in Watt to identify when the GPU is doing training. But Nvidia is very unstable as the trigger level can change over time, especially when we do several trainings in a row, like for the four custom models data capture.
The transition between low and high consumption can also be slower, meaning we can capture many values slightly higher than our trigger level but much smaller than the highest levels, meaning the GPU is not yet running the training code. So, to remove this effect of the changing levels and the transition phases, I only kept the values exceeding the quantile 0.3:
power_list_filtered = power_list[power_list>=np.quantile(power_list,0.3)]
This quantile method is specific to the custom models test. Since the four models and batch size were trained together in a single program, we have only one file to process. However, for ResNet50, training was carried out separately for each batch size, so this quantile method was unnecessary. Instead, I manually selected the zone to capture and avoided power transitions.
You can now use the same method to capture your power consumption.
Sources
Images and Code : All images and codes in this work are by the author unless explicitly stated otherwise.
Datasets Licences : CIFAR-10 is licensed under the MIT Licence, as per this paper, but generally, its license is unknown, as found in paperswithcode.
[1] Krizhevsky, Alex. Learning Multiple Layers of Features from Tiny Images. (2009).