Compute cost and environmental impact of Deep Learning

A framework to measure the cost in dollars and emissions of carbon dioxide pounds

Published in

Towards Data Science

4 min readMar 21, 2021

The Holyoke Dam produces hydroelectric power for the Massachusetts Green High Performance Computing Center. Photo by American Public Power Association on Unsplash

Compute used to train the state of the art deep learning models continues to grow exponentially, exceeding the rate of Moore’s Law by a wide margin. But how much does it cost in $ and what is its environmental impact? I will describe a method to approximate this in this article which is a summary of papers [1] and [2].

Goals

Estimate the cloud compute cost and carbon emissions of training well known NLP models.
Estimate the cloud compute cost and carbon emissions of the full R&D required for a new state of the art DL model.
Compare carbon emissions to common benchmarks such as 1 year average human life or a car’s lifetime.

Non-goals

We don’t cover serving cost of the DL models to users. Apart from traffic, they depend on geographical region served. Also we don’t cover networking and data storage costs, and researchers’ salary.
Power and compute efficiency depends on the type of GPU. More efficient hardware development (such as TPUs offered on Google Cloud) is an active area of research.

Energy use estimation

Total power draw can be estimated as a sum of CPU, memory and GPU power draw. The authors of [1] measured it by training each of the models using their open source code, sampling and averaging CPU, memory and GPU draw by using commonly available tools such as nvidia-smi and Intel power management interface. g is the number of GPUs used.

Time required to train the modales was taken from original papers.

1.58 is a Power Usage Efficiency coefficient (1.58 is 2018 average reported for data centers globally). This multiplier factors in all other energy required to run the model, mainly cooling. This can be optimized, e.g. Google reported PUE 1.1 in its data centers.

We divide by 1000 to konvert from Watt hours to kWH.

Equasion to compute the power consumption during training. Image from [2]

CO2 estimation

The U.S. Environmental Protection Agency (EPA) reports the average CO2 produced (in pounds per kilowatt-hour) for power consumed in the U.S. (EPA 2018) as: 0.954. Since major cloud providers have a higher share in renewable energy, we’ll cut this number in half. In Google’s case it would be close to 0 since it’s a heavy renewable energy user and also offsets carbon by purchasing green energy.

Renewable energy consumption in countries vs cloud providers

Cost estimation

Pre-emptible and on-demand cost of GPU and TPU use her hour from Google. TPUs cost more per hour but are more cost and power efficient for workloads which are suitable for that hardware.

Results

The authors of [1] chose well known deep learning models: Transformer, ELMo, BERT and GPT-2, as well as Evolved Transformer which is using Neural Architecture Search idea to train multiple models to find the best configuration. NAS can approximate the compute cost of the full cycle of Research and Development to find a new state of the art model.

5 models used in this study. Provided is high level architecture, hardware used, model side and training time. Image from [2]

Co2 emissions and monetary cost of training of some well known deep learning models. TPUs are far more cost efficient (in the case of NAS $44k to $146k vs a couple million $) and likely far more likely energy friendly.

Model training CO2 consumption vs common benchmarks. Image from [2]

Parting thoughts

Modern ML training (and inference, not covered in this article) results in substantial carbon emissions and monetary cost. Financial analysts such as ARK Invest predict Deep Learning market cap to grow from $2 trillion in 2020 to $30 trillion in 2037 [3].

In this article I wanted to provide some framework to quantify emissions and cost. When Trevor Noah asked Greta Thunberg about what people can do about the climate change [4], the one most important thing she suggested is to inform oneself, understand the situation and finally push a political movement.

Personally, I worked on making models smaller to fit on mobile devices (this has an added benefit to reduce latency!), training multi-task models, multi language models, reporting memory model size and FLOPS (this can be easily profiled in Tensorflow [7]) and keeping an eye to kill my jobs which don’t show promising results at early stages of training.

References

[1] Strubell, E., Ganesh, A., & McCallum, A. (2020). Energy and Policy Considerations for Modern Deep Learning Research. Proceedings of the AAAI Conference on Artificial Intelligence, 34(09), 13693–13696. https://doi.org/10.1609/aaai.v34i09.7123

[2] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. https://arxiv.org/abs/1906.02243

[3] https://research.ark-invest.com/hubfs/1_Download_Files_ARK-Invest/White_Papers/ARK%E2%80%93Invest_BigIdeas_2021.pdf

[4] https://youtu.be/rhQVustYV24?t=303

[5] http://www.cis.upenn.edu/~bcpierce/papers/carbon-offsets.pdf

[6] https://cloud.google.com/blog/topics/sustainability/sharing-carbon-free-energy-percentage-for-google-cloud-regions

[7]https://www.tensorflow.org/api_docs/python/tf/compat/v1/profiler/ProfileOptionBuilder