The world’s leading publication for data science, AI, and ML professionals.

Artificial Intelligence is a Supercomputing problem

AI practitioners cannot shy away from our responsibility

SUPERCOMPUTING FOR ARTIFICIAL INTELLIGENCE – 01

Marenostrum Supercomputer - Barcelona Supercomputing Center (image from BSC)
Marenostrum Supercomputer – Barcelona Supercomputing Center (image from BSC)

[This post will be used in the master course Supercomputers Architecture at UPC Barcelona Tech with the support of the BSC]

The next generation of Artificial Intelligence applications impose new and demanding computing infrastructures. How are the computer systems that support artificial intelligence? How did we get here? Who has access to these systems? What is our responsibility as Artificial Intelligence practitioners?

It’s an exciting time for Artificial Intelligence. We have impressive scientific data analysis systems at Barcelona Supercomputing Center in genomics, bioinformatics, astronomy, amount many others. Systems that can do amazing things that we didn’t think were possible a few years ago.

Also, for general purpose applications, we are moving very fast. For example, in Video Analytics, our research group at UPC & BSC obtains valuable video object segmentation results with referring expressions. Given a video and a linguistic phrase, we show how to generate binary masks for the object to which the phrase refers.

Image by Author based on images from RefVOS: Referring Expressions For VOS, M. Bellver et all.
Image by Author based on images from RefVOS: Referring Expressions For VOS, M. Bellver et all.

Trigger for the AI explosion

The question is, why now? Artificial Intelligence has been around since the middle of the last century. John McCarthy coined the term Artificial Intelligence in the 1950s, being one of the founding fathers of Artificial Intelligence along with Marvin Minsky. Also, in 1958, Frank Rosenblatt built a prototype neuronal network, which he called the Perceptron. Besides, the key ideas of the Deep Learning neural networks for computer vision were already known in 1989; also, the fundamental algorithms of Deep Learning for time series such as LSTM were already developed in 1997, to give some examples. So, why now this Artificial Intelligence boom?

Let’s try to find out a trigger for the AI explosion. Oriol Vinyals suggest that Datasets play an important role according to a recent tweet from him:

Clearly, the availability of big datasets has contributed to the Algorithmic efficiency in Deep Learning, doubling every 16 months over a period of 7 years:

Image Source: https://openai.com/blog/ai-and-efficiency/
Image Source: https://openai.com/blog/ai-and-efficiency/

This means that operations required to train a classifier to AlexNet-level performance on ImageNet have decreased by a factor of 44x between 2012 and 2019.

In this excellent presentation at MIT, Oriol Vinyals also add in this direction the important contribution big companies and universities with open-source projects like TensorFlow, Pytorch, MXNet, and so on. These DL frameworks allow that we now have access to a vast amount of essentially state-of-the-art components helping researchers to focus on the core algorithmic components we thought perhaps paying too much attention to the details on implementation, which helps accelerate progress in algorithms.

Big datasets, and open-source DL framework, play an important role to create "big" algorithms. But the current excitement is due to another crucial component, which was not present before 2012 when AlexNet won ImageNet. Which other things, besides data and algorithms, were now available?

I don’t want to refuse Oriol Vinyals affirmation; he is the boss in this field!!! and a good friend of our research group! 😉

The answer is BIG Computers. "COMPUTING POWER is a key component of the progress of Artificial Intelligence. Nowadays, Deep Learning or Reinforcement Learning is the result of mix these three components:

Image by Author
Image by Author

How has computing evolved to meet the needs of artificial intelligence?

Have a look at this graphic from OpenAI that has become very popular:

Image by Author (data source)
Image by Author (data source)

Since 2012, the amount of computation required (or available) to generate artificial intelligence models has increased exponentially (The Y-axis is a logarithmic axis).

A petaflop/s-day (pfs-day) consists of performing 10 to 15 operations per second for one day, or a total of about 10 to 20 operations.

Also, during this period, these computing requirements for training the models have grown by more than 300,000x. The amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4x month doubling time.

The end of Moore’s Law

Let’s go back a bit and see how computing has evolved. Much of the improvement in computer performance comes from decades of miniaturization of computer components. All of you have heard about Moore’s Law. Right?

In 1975, Intel founder Gordon Moore predicted the regularity of this miniaturization trend, now called Moore’s law, which, until recently, doubled the number of transistors on computer chips every two years.

Original paper: Moore, G. Progress in digital integrated electronics. In Proceedings of the International Electronic Devices Meeting (Washington, D.C., Dec.). IEEE, New York, 1975, 1113.

Although Moore’s Law held for many decades, it began to slow sometime around 2000 and by 2018 showed a roughly 15-fold gap between Moore’s prediction and current capability (of processors created by companies like Intel). The current expectation is that the gap will continue to grow as CMOS technology approaches fundamental limits!

Sadly, just when we need much faster machines for Deep Learning, Moore’s law began to slow down!

In reality, other important observations occur in the computer architecture community that accompanying Moore’s Law: Dennard Scaling was a projection made by Robert Dennard, stating that as transistor density increased, power consumption per transistor would drop, so the power per mm2 of silicon would be near constant. Since the computational capability of a mm2 of silicon was increasing with each new generation of technology according Moore’s Law, computers would become more energy efficient. However, Dennard scaling projection began to slow significantly in 2007 and its benefits disappeared around 2010.

With the end of the Dennard Scaling, increasing the number of cores on a chip allowed the power to also increase at about the same rate. But the energy that goes into a processor must also be removed as heat. Therefore, multi-core processors are limited by heat dissipation power.

In short, the result of applying all these observations can be summarized in the following graph based on the original one "Growth of computer performance", created by Hennessy and Patterson:

Image by Author (data source)
Image by Author (data source)

In this graph, we can appreciate that in the 1980s and 90s, when all these laws and observations were alive, and well, we were turning transistors into fasts computers, so doubling performance about every 18 months at the time.

What good times! What a longing!. Now we have only the same improvement of per two each 20 years approximately. In summary, from a factor of 2 every 18 months to a factor 1.05 every 18 months.

In a very general way, taking the idea from a talk at Ray Summit by Professor Ion Stoica from Berkeley, we can visually represent the impact of computing performance growth in the previous graph (approximately). As can be seen, taking into account that we are moving in a logarithmic Y-axis, in no case, allow us to respond to the needs of AI algorithms.

Image by Author
Image by Author

Well, while Moore’s Law may have ended, the demand for increased compute has not. So, a question arises, how without Moore’s Law to get faster machines?

What about specialized hardware?

To address this challenge, computer architects have focused their attention on building domain-specific processors that trade generality for performance. The idea behind it is, "Don’t try to do everything, but just do a few things exceptionally". Companies have raced to build specialized processors, such as Nvidia’s GPUs and Google’s TPUs:

Image Source: Google and NVIDIA
Image Source: Google and NVIDIA

What we mean by "doing a few things exceptionally"? For instance, GPUs contain hundreds of Tensor Cores that operate on a 4×4 matrix, which greatly accelerates the computation of basic operations in Deep Learning, such as the multiplication of the data matrix by the weight matrix and then the sum of the bias.

But in the end, specialized hardware is not enough. At the same time, accelerators like GPUs and TPUs bring more computational power to the table, they essentially help to prolong Moore’s Law further into the future, not to fundamentally increase the rate of improvement.

In a very general way, using the same OpenAI graph, we can visually represent the impact of the performance improvements of specialized architectures in relation to CPUs. But, as it can be seen, in no case, allow to respond to the needs of Deep Learning and Reinforcement Learning applications:

Image by Author
Image by Author

Parallelism: more than one domain-specific processors

Maybe we can put multiple domain-specific processors to work together?. **** Let’s see a concrete example that we have at BSC based on 4 GPUs that can work in parallel:

Image Source: https://bsc.es
Image Source: https://bsc.es

This server, provided by IBM, has two CPUs, power 9, and 4 NVIDIA V100 GPUs. Now, how can we employ these resources to improve computational speed? In the case of deep learning, we typically have two parallelism approaches that can be taken to accomplish this purpose:

  • Model Parallelism
  • Data Parallelism

In the first approach, we have the different layers of the network distributed across different devices; meanwhile, in the second approach, we have the same model in every one of the GPUs, but they are each processing a separate piece of the data, a separate portion of the mini-batch.

Image by Author
Image by Author

Model parallelism is very useful when we have a large model that might not fit in a single GPU memory.

Data parallelism, however, is what most practitioners typically use to scale up the training process of a Deep Learning model because they have a data set that is so large that completing a single epoch on a single GPU can take a very long time, maybe, hours, days or even weeks.

So when it is possible to share the data set to speed up training, we do it, as long as the model can tolerate a larger batch size.

We can use frameworks as TensorFlow of Pytorch to program a multi-GPU training. To parallelize the training of the model, you only need to wrap the model with [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)in PyTorch and with [tf.distribute.MirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy)in TensorFlow. Very easy!

TensorFlow and PyTorch require a software stack that includes different software layers installed in the execution environment as Python packages. Also, libraries as [cuDNN](https://developer.nvidia.com/cudnn) from NVIDIA help us squeeze all the accelerators’ power, such as using the Tensor Cores I mentioned.

For instance, when I execute a Deep Learning code in our supercomputer in Barcelona, I need to load all these modules listed here with the module load command:

$ module load python/3.7.4_ML cudnn/7.6.4 cuda/10.2 nccl/2.4.8tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 gcc/8.3.0

A huge world, to which I am partly dedicated, and very important, often transparent for Deep Learning users like you.

But in the end, with 4 GPUs, it is not possible to meet the needs of the challenges that arise in Deep Learning or Reinforcement Learning:

Image by Author
Image by Author

And the number of GPUs that we can place in a server is very limited; in the case of a server with the Power 9 we are talking about, we can reach a maximum of 6 GPUs.

Multi-server: Distributed Computing

The way companies have to fix this is to put many of these servers together! And this is what we did at the BSC, with this investigational platform, where 54 servers are linked together with an InfiniBand network on optical fiber.

Image by Author
Image by Author

These servers run Linux as their operating system, and each is composed of two Power 9 processors and four NVIDIA GPUs with 512 GB of main memory, which means that we count with more than two hundred GPUs.

InfiniBand is an industry-standard to interconnect servers that allows the local memory of one server to be accessed from remote servers speedily.

In this new scenario, we need an extension of the software stack to deal with multiple distributed GPUs in the neural network training process. There are other options, but in our research group at BSC, we decided to use [Horovod](https://towardsdatascience.com/distributed-deep-learning-with-horovod-2d1eea004c b2), from Uber. Horovod Plugs into TensorFlow, PyTorch, and MXNet.

[Horovod](https://towardsdatascience.com/distributed-deep-learning-with-horovod-2d1eea004c b2) uses Message Passing Interface (MPI) to communicate the processes executed in a distributed fashion. MPI is a programming model ubiquitously present in any supercomputer to communicate processes executed in different servers. It also uses the NVIDIA NCCL2 library to manage the data communication between GPUs in a server.

To speed up training, Horovod uses data parallelism training model introduced before. That is to say, all workers train on different data, all workers have the same copy of the model, and Neural network gradients are exchanged.

And the sum of parallelism and distribution strategies has allowed this growing demand for computing that the Artificial Intelligence community required during these years.

New big computers: Monsters of Supercomputing

This mixture of hardware and software techniques allows the creation of true monsters of supercomputing. For instance, Google has computing infrastructures with hundreds of TPUs, that can be put together to collaborate in solving challenges that arise in the Deep Learning and Reinforcement Learning community.

Image Source: https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer
Image Source: https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer

A recent paper from Google presents a model for Multilingual translation quality with 600 Billion parameters. To get an idea of the magnitude of the problem, we can compare it with the famous GTP-3, the third-generation language prediction model created by OpenAI. It "only has" 175 billion parameters.

In this case, we are talking about computing requirements equivalent to 22 years with 1 TPU. In the paper, the authors measure performance in TPU-years. An interesting metric!. This means that if we only had available one TPU, it would take us 22 years to do the training.

In this case, Google distributed the training over 2048 TPUs and achieved results in only 4 days.

Detailed information about the system architecture of these TPU based infrastructures from Google can be found here.

UPDATE 15/04/2021- Research from Microsoft, NVIDIA, and Stanford University: they study how to scale models of one trillion parameter models. See the paper Efficient Large-Scale Language Model Training on GPU Clusters.

New big algorithms: Deep Reinforcement Learning

In the previous section we considered a Deep Learning problem as an example of today’s Artificial Intelligence applications that eagerly and quickly eat computing. But actually cutting edge applications in the Artificial Intelligence arena are based on Deep Reinforcement Learning models that require vast amounts of computing.

If you want to get introduced to Reinforcement Learning with an introductory series of posts that cover the basic concepts in Reinforcement Learning and Deep Learning to begin in the area of Deep Reinforcement Learning, you can find it here.

A few years ago, foundational algorithms as DQN were conceived for consuming a few hardware resources; for instance, 1CPU+1GPU was enough. If we follow the timeline, the initial versions of distributed Reinforcement Learning (RL) algorithms required only a few more CPUs; such as the Asynchronous Actor-Critic method (A3C), which works very well with a few CPUs and a single GPU ".

However, if we take a closer look at Reinforcement Learning algorithms’ most recent evolution, we see that they have increasingly required more computing resources. For instance, two years ago, a large-scale distributed RL named IMPALA was designed to take advantage of hundreds of CPUs.

The architecture of current Distributed Reinforcement Learning Agents is usually separated into actors and learners. This is the case of IMPALA:

https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html
https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

The actors typically executed on CPUs, iterate between taking steps in the Environment and running inference on the model to predict the next action. After collecting a sufficient amount of observations, the actor will send a trajectory of observations and actions to the learner. Then the Learner optimizes the model and sends the parameters of the model to the actors, and each actor updates the parameters of its inference model. In this algorithm, the learner trains the model on GPUs using input from distributed inference on hundreds of CPU machines.

A most recent version of a distributed Reinforcement Learning method, the SEED method from DeepMind, allows to use more than two hundred TPUs, amazing, right!! Allowing what we consider a true Massively Scaling Reinforcement Learning.

Image by Jordi Torres
Image by Jordi Torres

Okay, without entering in detail, say that the SEED Reinforcement Learning architecture is designed to solve some drawbacks present in IMPALA method. In this case, neural network inference is made centrally by the learner on TPUs (not in the actor as in IMPALA), enabling accelerated inference and avoiding the data transfer bottleneck by ensuring that the model parameters and state are kept local.

While the actor sends observations to the learner at every environment step, latency is kept low due to a very efficient network library gRPC (equivalent in functionality to MPI I mentioned before). This makes it possible to achieve up to a million queries per second on a single machine.

Image source: https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html
Image source: https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

In summary, the learner can be scaled to thousands of cores (e.g., up to 2048 TPUs), and the number of actors can be scaled to thousands of machines to fully utilize the learner, making it possible to train at millions of frames per second. Impressive, right?

Artificial Intelligence: Supercomputing power is the real enabler!

We can conclude that computing is responding to the needs of the Artificial Intelligence community, allowing us to solve the proposed models. My thesis in this publication is that COMPUTING POWER is the real enabler or, if you prefer, a key component of the progress of Artificial Intelligence, when we mix these three components: BIG DATA, BIG ALGORITHMS and, BIG COMPUTERS.

What drove changes in effective compute over this period? OpenAI split the AI and Compute trend into Moore’s Law and increased spending/parallelization, as well as progress in algorithmic efficiency:

Image Source: Measuring the Algorithmic Efficiency of Neural Networks Danny Hernandez, Tom B. Brown OpenAI.
Image Source: Measuring the Algorithmic Efficiency of Neural Networks Danny Hernandez, Tom B. Brown OpenAI.

The authors estimate a 7.5 million times increase in the effective training compute available to the largest AI experiments between 2012 and 2018.

And it seems that computing will continue responding to the needs of the Deep Learning and Reinforcement Learning community, allowing them to solve the required models.

Imagine that Google needs more computing power for a new Reinforcement Learning algorithm, then, the only thing that Google will do is aggregate more parallel and distributed servers! And that’s all!

For example, a few months ago, Google breaks AI performance records in the Industry-standard benchmark MLPerf. MLPerf benchmark models are chosen to be representative of cutting-edge machine learning workloads.

In this case, the only thing that Google needed to do is aggregate more servers. The resulting system includes 4096 TPUs and hundreds of CPU host machines connected via an ultra-fast interconnect. In total, this system delivers over 430 PFLOPs of peak performance.

Image Source: https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer
Image Source: https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer

It seems that for now, adding servers allows us to respond to the needs of AI models. Easy, Right? Well, it’s not like that!

Something to think over

Before finishing my publication, let me assign you some "homework" to do on your own. Do you commit to doing them? I hope so!

Who can own and pay for these supercomputers?

After reading the previous section, an important issue that arises is how much does the computing bill take to solve these challenges? Have you thought about that?

For instance, according to the following tweet, the estimated cost of training the transformer GPT-3 language model that uses Deep Learning to produce human-like text, which I mentioned earlier, is nearly 12 million dollars on the public cloud.

UPDATE 29/11/2020: A new paper from Tencent is another demonstration of the power of scale, using for their training a cluster that involves 250,000 CPU cores and 2,000 NVIDIA V100 GPUS.

Maybe you have heard about the top500 list, the list that records the fastest computers in the world, issued twice a year, in June and November during the two main Supercomputing conferences in the world (SC, ISC).

In general, there are supercomputers hosted in public institutions, with the Peak performance in Teraflops (10 to 12 operations per second) shown in the following table. For instance, the Top1 has a Peak performance of 500,000 Teraflops.

TFlop/s = 1 000 000 000 000 numerical operations per second.

Top 10 from Top500 list- June 2020) (Image by Jordi Torres)
Top 10 from Top500 list- June 2020) (Image by Jordi Torres)

Now, Marenostrum 4, the supercomputer hosted in Barcelona, in the chapel of Torre Girona at UPC university campus, occupies position 38 in this list, not bad for us! (virtual visit).

The Google system mentioned before, which includes 4096 TPUs and hundreds of CPU host machines connected via an ultra-fast interconnect, delivers over 430,000 TFLOPs of peak performance. Near the top one in the world (according to June 2020 list), and far from the second one and others!

For creating AI, we need Supercomputers. Who can own and pay for these supercomputers? Nation-states and multi-national corporations only?

Artificial Intelligence Carbon Footprint

Last week in the Spanish newspaper La Vanguardia appeared this article: "The digital world is the third polluter on the planet".

Image by Jordi Torres and Júlia Torres
Image by Jordi Torres and Júlia Torres

Also, research by the University of Massachusetts advises about the unsustainable costs of Artificial Intelligence. They claim that the estimated carbon cost of training a common NLP model is comparable to the amount produced by 125 round-trip flights from New York to Beijing.

These numbers are probably relative, as organizations can power their computing facilities with renewable energy sources. Then they can reduce the carbon footprint. For instance, Iceland’s energy is sourced from 100 percent renewable geothermal and hydroelectric power, and its national grid is modern and reliable; this means that the Artificial Intelligence systems housed there operate more efficiently and deliver cleaner energy.

But, even the numbers are inflated, the exponential growth of computing needs by Artificial Intelligent makes it difficult to think that we can power supercomputing with only green energy in the short term.

At present, the vast majority of Artificial Intelligent research in algorithms is focused on achieving the highest levels of accuracy, without much concern paid to computational or energy efficiency. But as the world’s attention has shifted to climate change, should the field of Artificial Intelligence begin to take note of its carbon footprint?

We cannot shy away from our responsibility

Artificial Intelligence is definitely penetrating society, like electricity, what will we expect? The future we will "invent" is a choice we make jointly, not something that happens.

This is good! For instance, Genetics and genomics look for mutations and links to disease from the information in DNA and with the help of Artificial Intelligence, body scans can spot diseases early and predict the health issues people might face based on their genetics.

But as in most things in life, where there is light, there is shadow. Artificial Intelligence algorithms propagate gender biases, and AI systems have monitored citizens at large without their informed consent. Among many other bad things!

We must mull over the imminent adoption of Artificial Intelligence and its impact. Were we to go on to build Artificial Intelligence without regard to our responsibility of preventing its misuse, we can never expect to see Artificial Intelligence help humanity prosper.

All of us, who are working or want to work on these topics, cannot shy away from our responsibility, because otherwise, we will regret it in the future.

Thank you for reading this publication!

Acknowledgment: Many thanks to Juan Luis Domínguez, Alvaro Jover Alvarez, Miquel Escobar Castells and Raul Garcia Fuentes for their contributions to the proofreading of this document.


Emerging Technologies for Artificial Intelligence Research Group at BSC-CNS

Our research group at Barcelona Supercomputing Center and UPC Barcelona Tech is doing research on this topic.

Motivation

Real-world challenges, e.g., image processing in sectors such as health or banking, among others, are driving fundamental research to the creation of novel large deep and reinforcement learning models. However, the creation of these new models is only a part of the solution to these challenges. Training processes of these models require massive amounts of computation and execution time. But scaling up large deep and reinforcement learning models in today’s parallel and distributed infrastructures has become a significant challenge as it requires great multidisciplinary expertise in machine learning and supercomputing. In general, these are two areas of research that so far have not they have gone together; and now it is required an effort in providing joint solutions in the parallelization of these algorithms, demanding not only reprogramming them, as knowing how to use the parallel and distributed resources efficiently. Because as Rich Sutton says recently, one of the leading researchers in Reinforcement Learning, "general methods augmented by massive computation are the most effective". Our research group aims to introduce solutions that overlap these two research worlds. The AI ​​revolution is not only about new mathematical models; it’s about how to take advantage of the unprecedented opportunities that HPC offers for next-generation deep and reinforcement learning methods.

Our latest paper in this area

"Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills" presented in the 37th International Conference on Machine Learning (ICML2020). The paper presents a novel paradigm for unsupervised skill discovery in Reinforcement Learning. It is the last contribution of @vcampos7, one of our Ph.D. students co-advised with@DocXavi. This paper is co-authored with @alexrtrott, @CaimingXiong, @RichardSocher from Salesforce Research.

About BSC and UPC

The [Barcelona](https://en.wikipedia.org/wiki/Barcelona) Supercomputing Center (BSC) is a public research center located in Barcelona. It hosts MareNostrum, a 13.7 Petaflops supercomputer, which also includes clusters of emerging technologies. In June 2017, it ranked 13th in the world.

The Polytechnic University of Catalonia (Universitat Politècnica de Catalunya), currently referred to as BarcelonaTech, and commonly known as UPC, is the largest engineering university in Catalonia, Spain. It also offers programs in other disciplines such as mathematics and architecture.


Content of this series:

SUPERCOMPUTING FOR ARTIFICIAL INTELLIGENCE

  1. Artificial Intelligence is a Supercomputing problem
  2. Using Supercomputers for Deep Learning Training
  3. Scalable Deep Learning on Parallel and Distributed Infrastructures
  4. Train a Neural Network on multi-GPU with TensorFlow
  5. Distributed Deep Learning with Horovod

[Deep Reinforcement Learning Explained – Jordi TORRES.AI](https://torres.ai/deep-reinforcement-learning-explained-series/)


Related Articles