The world’s leading publication for data science, AI, and ML professionals.

Towards Green AI: How to Make Deep Learning Models More Efficient in Production

From Academia to Industry: Finding the best trade-off between predictive performance and inference runtime for sustainability in Machine…

The Kaggle Blueprints

Making s'mEARTHs at the GPU bonfire (Image hand-drawn by the author)
Making s’mEARTHs at the GPU bonfire (Image hand-drawn by the author)

This article was originally published on Kaggle as an entry to the "2023 Kaggle AI Report" competition on July 5th, 2023, in which it won 1st place in the category "Kaggle competitions". As it reviews Kaggle competition writeups, it is a special edition of "The Kaggle Blueprints" series.

Introduction

"I think we’re at the end of the era where it’s going to be these, like, giant, giant models. […] We’ll make them better in other ways.", said Sam Altman, CEO of OpenAI, shortly after their release of GPT-4. This statement surprised many, as GPT-4 is estimated to be ten times larger (1.76 trillion parameters) than its predecessor, GPT-3 (175 billion parameters).

"I think we’re at the end of the era where it’s going to be these, like, giant, giant models. […] We’ll make them better in other ways." – Sam Altman

In 2019, Strubell et al. [1] estimated that training a natural language processing (NLP) pipeline, including tuning and experimentation, produces around 35 tonnes of carbon dioxide equivalent, more than twice the average U.S. citizen’s annual consumption.

Let’s put it more into perspective: It was reported that information technologies produced 3.7% of global CO2 emissions in 2019. That’s more than global aviation (1.9%) and shipping (1.7%) combined!

Deep Learning models have been pushing state-of-the-art performances across different fields. These performance gains are often the results of larger models. But building bigger models requires more computations in both the training and the inference stage. And more computations require bigger hardware and more electricity and thus emit more CO2 and lead to a bigger carbon footprint, which is bad for the environment.

The eye-opening paper by Strubell et al. [1] has led to the birth of a new research field called "Green Ai" in 2020. The term was coined by Schwartz et al. [2] to describe research "that yields novel results without increasing computational cost and ideally reducing it". Since then, many papers in the field have emerged to reduce the carbon footprint of AI, especially Deep Learning models.

In an effort towards reducing the carbon footprint of Deep Learning models, Kaggle, a platform for Data Science competitions, has introduced an "Efficiency Prize" to some of their competitions:

We are hosting a second track that focuses on model efficiency, because highly accurate models are often computationally heavy. Such models have a stronger carbon footprint and frequently prove difficult to utilize in real-world […] contexts.

Although carbon footprint is produced along the entire lifecycle of a Deep Learning model, Kaggle can’t influence the competitors’ actions at every stage. But by evaluating submissions on runtime in addition to their predictive performance, Kaggle can encourage competitors to build more efficient solutions to reduce the carbon footprint, at least in the inference stage.

At the beginning of this year, a survey was released claiming that the Green AI research field is reaching a level of maturity and that it is now time to "port the numerous promising academic results to industrial practice" to evaluate the techniques outside of laboratory settings [7].

Although Kaggle can’t be used as a direct proxy for industry practices, Kaggle is a perfect place to test new techniques outside of laboratory settings. Thus, this articles prompt is:

What has the community learned over the past two years of balancing model performance and inference time in Kaggle competitions with an Efficiency Prize to reduce the carbon footprint of Deep Learning models in production?


We will first review the promising academic results in the "Green AI" literature. Then, we will review which of them have already been adopted and proven successful by the Kaggle community by examining writeups from winners of the Efficiency Prize.

Background

Carbon footprint is generated along the entire Machine Learning Operations (MLOps) lifecycle. Since 2021, a handful of surveys [3, 4, 5, 6, 7] have categorized the techniques to reduce carbon footprint by the MLOps stage it is produced in. While the stages differ slightly, the main three steps are model design, development, and inference [3, 4, 5, 6]. Other categories cover data storage and usage [3, 5, 6] or hardware [4].

In 2019, both AWS [21] and NVIDIA [22] estimated that roughly 90% of the machine learning workload is inference. With the available context of the Efficiency Prize, this essay’s prompt focuses on improving efficiency in the inference stage.

Note that the model design significantly impacts the efficiency in the inference stage as well. However, the model design can be specific to the use case and thus requires more data to effectively analyze which model design techniques can help reduce the carbon footprint in the inference stage. Thus, we will first focus on the model-agnostic techniques to reduce carbon footprint in the inference stage to keep this analysis concise. Future work is encouraged to analyze the model design when the Efficiency prize has been introduced to a wider range of competitions (see Discussion).

The taxonomy proposed by Xu et al. [3] in 2021 has also been adopted and used in a recent survey [6]. Thus, we will structure our analysis based on this taxonomy as well. The following techniques aim to reduce the model size to gain latency improvements because model size directly impacts the latency:

  • Pruning reduces the size of the neural network by removing redundant elements, such as weights, filters, channels, or even layers, without losing performance. It was first proposed by LeCun et al. [8] in 1989.
  • Low-rank factorization reduces the complexity of convolutional or fully connected layers in neural networks by factorizing a weight matrix into two matrices with low dimensions (matrix/tensor decomposition).
  • Quantization reduces the model size by reducing the weights’ and activations’ precision (usually from 32-bit floating point values to 8-bit unsigned integers). This reduction in precision can lead to small performance loss. You can quantize a model either during training or after training. There’s also static and dynamic quantization. For this article, you only need to know that dynamic quantization is more accurate than static quantization. However, dynamic quantization also requires more computations than static quantization.
  • Knowledge Distillation is a technique to distill the knowledge of a large-scale high-performing model/ensemble (teacher network) into a compact neural network (student network) by using additional data pseudo-labeled from the larger model to train the smaller model. The pseudo labels are also called soft labels and contain so-called dark knowledge, which helps the student network learn to mimic the teacher network. This idea was proposed by Hinton et al. [9] in 2015.

Methodology & Data Collection

To evaluate which promising academic results have already been tried and proven outside of laboratory settings, we will review the solution writeups of Kaggle competitions with an Efficiency Prize.

Unfortunately, the original competition dataset of Kaggle writeups doesn’t cover all efficiency writeups because some teams write separate writeups for the Efficiency Prize. Thus, we decided to create a custom dataset for this competition.

  1. Identify all Kaggle competitions with an Efficiency Prize.
  2. Identify the top 10 teams in the Efficiency Prize according to the Efficiency Leaderboard notebooks.
  3. Manually collect the writeup of every top efficiency solution by searching the discussion forums and leaderboards. If a team wrote two writeups, select the one for the Efficiency Prize (e.g. Standard Leaderboard writeup vs. Efficiency Leaderboard writeup from the same team).

Out of the 50 top 10 efficiency solutions, we were able to collect 25 available writeups.

The writeups are distributed among the five competitions, as shown below.

Results

This section reviews the collected writeups from top solutions in the Kaggle Efficiency Prize to see if any techniques for reducing inference costs, including pruning, low-rank factorization, quantization, and knowledge distillation, were used and whether they were successful.

For each of the techniques, we manually reviewed every single writeup of the 25 to answer the following questions:

  1. Did competitors experiment with the technique? How was it applied?
  2. Was the technique effective?
  3. Is there any indication as to why the technique was effective/ineffective?

Pruning

Only one [20] of the 25 top efficiency writeups mentioned pruning.

However, they couldn’t report any success [20]:

Beginning to explore prunning as well as hard quantization showed that the performance loss would be significant (which is OK in production but not in a competition) so we sticked with a simple TF Lite conversion.

What’s interesting about this writeup is its mention that the found performance loss would have been acceptable in production but not in this competition setting, which is an indicator that the criteria for whether a trade-off is good or not may vary between a Kaggle competition and the industry (see Discussion).

Low-Rank Factorization

None of the 25 top efficiency writeups mentioned low-rank factorization.

Because there is no mention in the writeups, we don’t know whether none of the competitors tried it or whether any competitors experimented with it, but it wasn’t practical.

Because low-rank factorization was mentioned in a discussion post in the Feedback Prize – English Language Learning competition, we can assume that some competitors were aware of this technique. However, a recent survey [3] already concluded that low-rank factorization was computationally complicated and less effective in reducing computational cost and inference time than other compression methods.

Quantization

Quantization was not mentioned before the Feedback Prize – English Language Learning competition. But in the latter, it was mentioned in over half of the writeups but reported not to have been effective. Although quantization was mentioned only in one writeup in the Learning Equality – Curriculum Recommendations competition, it was reported as successful there.

Quantization was first mentioned in the Feedback Prize – English Language Learning competition. However, the writeups mentioning quantization reported that it didn’t work. Two writeups [13, 17] specifically reported that they tried dynamic quantization, resulting in performance loss and no runtime improvements:

Because quantizing nn.Linear layers directly affects the output of the model, quantizing layers is not suitable for regression tasks.

In the Learning Equality – Curriculum Recommendations competition, a competitor reported that post-training dynamic quantization was successful with a slight performance drop but increased the inference speed [19]:

If using qint8 also on the Feed Forward part of the transformer on the intermediate up sample and output layer, the score drop is even higher so I ended up in only using qint8 on the attention layer.

Based on the competitor’s solution [19], we can see that quantizing individual layers is almost as easy as quantizing an entire neural network.

# Code snippet taken from <https://github.com/KonradHabel/learning_equality/blob/master/eval_cpu.py>
from torch.quantization import quantize_dynamic

model.transformer.embeddings = quantize_dynamic(model.transformer.embeddings, None, dtype=torch.float16)

for i in range(config.num_hidden_layers):
    model.transformer.encoder.layer[i].attention = quantize_dynamic(model.transformer.encoder.layer[i].attention, None, dtype=torch.qint8)
    model.transformer.encoder.layer[i].intermediate = quantize_dynamic(model.transformer.encoder.layer[i].intermediate, None, dtype=torch.float16)
    model.transformer.encoder.layer[i].output = quantize_dynamic(model.transformer.encoder.layer[i].output, None, dtype=torch.float16)

Thus, the success of quantization depends on which layers in the neural network are quantized.

Knowledge Distillation

Out of the 25 writeups, knowledge distillation was mentioned in nine of them across three out of five competitions.

Note: Knowledge distillation is often mentioned in combination with pseudo-labeling. As many competitors use pseudo labels to retrain the existing model, we only focus on writeups where it was explicitly mentioned that pseudo labels were used to train a new smaller model.

Interestingly, all writeups mentioning knowledge distillation reported it as effective. Many competitors even reported knowledge distillation to have a high impact [13, 14, 16, 17] on the Efficiency Prize with only minimal performance losses [10, 16, 19].

A competitor described their knowledge distillation process, which they successfully applied in both the Feedback Prize – Predicting Effective Arguments [10] and the Feedback Prize – English Language Learning competition [15], as follows [10]:

  1. We generate pseudo labels data for the previous Feedback competition by using our large ensemble […].
  2. We also generate out-of-fold pseudo labels for our given train data
  3. Now we combine these 2 datasets with soft pseudo labels together and train a single new model without any original labels

Despite its simple implementation, we need to note that knowledge distillation depends on the availability of additional data. Many competitors [10, 13, 15] mentioned curating a dataset from previous competitions in the Feedback Prize competition series for pseudo-labeling.

Interestingly, knowledge distillation was applied successfully in NLP competitions, but the others, which were a computer vision and a tabular data competition, were unsuccessful. However, they didn’t have previous similar competitions for additional data as well.

Discussion

This section discusses what the community has learned about balancing model performance and inference time in Kaggle competitions with an Efficiency Prize to reduce the carbon footprint of Deep Learning models in production. Specifically, we discuss if the promising academic results were practical outside of laboratory settings.

What was the impact of Kaggle competitions with an Efficiency Prize on the broader ML community, and did the ML community benefit from them?

Verdecchia et al. [7] claimed that the Green AI research field is reaching a level of maturity and that it is now time to "port the numerous promising academic results to industrial practice" to evaluate their practicality outside of laboratory settings.

For this purpose, we reviewed solution writeups of Kaggle competitions with an Efficiency Prize. We saw that the Efficiency Prize encouraged competitors to experiment with different techniques to reduce carbon emissions in the inference stage.

The analysis showed that many of the academic techniques were tried by the Kaggle community. The Kaggle community could not confirm that pruning and low-rank factorization were practical techniques to achieve a good trade-off between efficiency and performance. However, it was shown that careful application of quantization and knowledge distillation were practical because of their simple implementation and effectiveness.

Did these competitions drive forward any important advancements in the field of ML?

Although the focus of this analysis was on evaluating the four techniques pruning, low-rank factorization, quantization, and knowledge distillation, we came across different techniques competitors found effective in reducing runtime without sacrificing predictive performance, such as converting a model to ONNX format [15, 16, 18].

Thus, the following steps would be to analyze the competition writeups from the opposite perspective to see which techniques were established in the Kaggle community to improve the inference speed of Deep Learning models.

What are the limitations of the impact of these competitions?

Carbon footprint is produced along the entire MLOps lifecycle. In this analysis, we specifically focused on the techniques categorized by literature under the inference stage. However, starting how we design models already impacts the carbon emissions of a solution at inference and should be analyzed in future work as well.

Additionally, although proven effective, knowledge distillation requires training a large-scale teacher network before the knowledge is distilled into a smaller network. Thus, while knowledge distillation reduces the carbon emissions in the inference stage, it must be noted that this technique produces additional carbon emissions in the training stage.

Thus, while the Efficiency Prize helps evaluate techniques to reduce the carbon footprint of a solution at inference, we need a more holistic approach to encourage competitors to reduce carbon emissions during a competition to move towards Green AI.

Conclusion

This analysis aimed to learn if the academically promising techniques to reduce the carbon footprint in the inference stage of the MLOps lifecycle were effectively applied in Kaggle competitions with an Efficiency Prize.

For this purpose, we first reviewed the available literature and found that recent surveys categorized the available techniques to reduce carbon emissions at inference into four main techniques: pruning, low-rank factorization, quantization, and knowledge distillation. Then we analyzed a custom dataset of Efficiency Prize writeups.

We found that the Kaggle community has tried many of the academic proposals:

  • Pruning was reported to be ineffective at achieving a satisfying trade-off.
  • Low-rank factorization was not mentioned in any of the top 10 writeups. We assume that this technique could be more computationally complicated to be practical.
  • Quantization was reported successful in only one case where the competitor did not quantize the entire model but curated which layers to quantize to which degree.
  • Knowledge distillation was shown effective for NLP competitions with additional data available from similar competitions.

We conclude that the Kaggle community has helped evaluate promising academic results from the Green AI literature outside of laboratory settings for their practicality quickly.

Thus, Kaggle should continue the Efficiency Prize to move towards Green AI. As a next step, Kaggle could add it to all competitions and, ideally, even have it no longer as a separate track but as the primary metric.


Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Find me on LinkedIn, Twitter, and Kaggle!

References

Dataset

[A] arXiv.org submitters. (2023). arXiv Dataset. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6141267

License: CC0: Public Domain

[B] iamleonie (2023). Kaggle Efficiency Writeups in Kaggle Datasets.

License: CC BY-SA 4.0

Image References

If not otherwise stated, all images are created by the author. See the Kaggle Notebook for the code.

References

Literature

[1] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243.

[2] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green ai. Communications of the ACM, 63(12), 54–63.

[3] Xu, J., Zhou, W., Fu, Z., Zhou, H., & Li, L. (2021). A survey on green deep learning. arXiv preprint arXiv:2111.05193.

[4] Menghani, G. (2021). Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys, 55(12), 1–37.

[5] Wu, C. J., et al. (2022). Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4, 795–813.

[6] Mehlin, V., Schacht, S., & Lanquillon, C. (2023). Towards energy-efficient Deep Learning: An overview of energy-efficient approaches along the Deep Learning Lifecycle. arXiv preprint arXiv:2303.01980.

[7] Verdecchia, R., Sallou, J., & Cruz, L. (2023). A systematic review of Green AI. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1507.

[8] LeCun, Y., Denker, J., & Solla, S. (1989). Optimal brain damage. Advances in neural information processing systems, 2.

[9] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Kaggle Writeups

[10] Team Hydrogen (2022). 1st place writeup in Feedback Prize – Predicting Effective Arguments

[11] Now You See Me (2022). 2nd place efficiency writeup in Feedback Prize – Predicting Effective Arguments

[12] Darjeeling Tea (2022). 5th place efficiency writeup in Feedback Prize – Predicting Effective Arguments

[13] Team Turing (2022). 1st place efficiency writeup in Feedback Prize – English Language Learning

[14] RUN OUT OF IDEAS💨 (2022). 2nd place efficiency writeup in Feedback Prize – English Language Learning

[15] Psi (2022). 5th place efficiency writeup in Feedback Prize – English Language Learning

[16] william.wu (2022). 6th place efficiency writeup in Feedback Prize – English Language Learning

[17] ktm (2022). 7th place efficiency writeup in Feedback Prize – English Language Learning

[18] Shobhit Upadhyaya (2022). 9th place efficiency writeup in Feedback Prize – English Language Learning

[19] Konni (2023). 2nd place efficiency writeup in Learning Equality – Curriculum Recommendations

[20] French Touch (2023). 5th place efficiency writeup in Predict Student Performance from Game Play

Web

[21] Barr, J. (2019). Amazon EC2 Update – Inf1 Instances with AWS Inferentia Chips for High Performance Cost-Effective Inferencing in AWS News Blog (accessed 16. July, 2023)

[22] Leopold, G. (2019). AWS to Offer Nvidia’s T4 GPUs for AI Inferencing in HPC Wire (accessed 16. July, 2023)


Related Articles