When to Run Code on CPU and Not GPU: Typical Cases

How to choose the hardware to optimize the computation for your use case

Published in

Towards Data Science

6 min readMay 21, 2023

With rapidly advancing technologies like Artificial Intelligence (AI), Machine Learning (ML), Internet of Things (IoT), Virtual Reality (VR) and sophisticated numerical simulations, the demand for computational power is reaching unprecedented heights. In most real-life cases, it’s not only the computational power that counts but also factors like the hardware size and power consumption.

When you’re designing a system depending on business and technical requirements, you can choose from various computational components like:

Integrated Circuits (ICs)
Microcontrollers (MCs)
Central Processing Units (CPU)
Graphics Processing Units (GPU)
Specialized chips like Tensor Processing Units (TPU).

Although this is an ever-evolving landscape of computing, two integral components have revolutionized the way we process data and execute complex tasks. These are the Central Processing Unit (CPU) and the Graphics Processing Unit (GPU). Both computing powerhouses are instrumental in propelling advancements in various fields:

Artificial intelligence (e.g. ChatGPT)
Scientific simulations (e.g. Finite Element Methods, CFD)
Gaming
Visual effects

Therefore, understanding the unique capabilities and performance characteristics of CPUs and GPUs is crucial to harness their full potential and optimizing the entire system to the business requirements.

When talking about reducing the cost of running computations, you should consider the following aspects:

Hardware Cost
Power Consumption
Performance Efficiency
Maintenance and Upgrades

Regarding purchase cost, CPUs are more affordable than medium- and high-end GPUs. However, GPUs consume more energy than CPUs due to their higher core counts and memory bandwidth. So if you design a simple IoT device powered from a 5V battery, you’ll probably focus on the power consumption and low hardware cost — then CPU (or even IC or MC) would be the best choice.

Also, there is an option of using public cloud resources from, e.g. Google, Amazon, or Azure to pay only for the usage time. It is the best choice if you design a web service or your hardware requires high computational power but is restricted by size or a need for remote access and control. However, in mass-production devices like smartphones or smartwatches, hardware cost and power consumption still play a decisive role.

Now, when considering computational costs, the total expense usually is a function of the time required to complete the task end-to-end. A good example is a training of a neural network (NN). It is a very computationally intensive task itself. From the end-to-end perspective, there are additional steps like data preparation and experiments tracking (for the hyperparameters optimization). It creates an additional overhead. However, in the phase of development, the training is still the bottleneck, so most popular ML frameworks (e.g. pytorch, Keras) support GPU calculations to tackle this problem. It is a classical case for utilizing the capabilities of GPU — the training of NN is ideal for massive parallelization. It is due to its underlying implementation. However, the inference itself (after the model is trained) can be done often on the CPU or even Microcontroller. The bottleneck after training may be then on the data preparation side (memory side) or I/O operations. For both CPUs are often more suitable. That’s why there are even dedicated microprocessors for such kind of tasks (e.g. Intel Atom® Processors x7000E). So finally, we came here to two different optimal solutions, depending on the environment: development (GPU) and production (CPU).

As you see, while GPUs excel in heavy parallel processing, there are several situations where CPUs outperform them from the end-to-end perspective. It depends on the nature of the algorithms and business requirements. If you’re a software developer or a system designer/architect, knowing these situations is crucial to deliver the optimal solution.

In this article, we will explore some of such scenarios.

Single-threaded recursive algorithms

There are algorithms that per design are not a subject of parallelization — recursive algorithms. In recursion, the current value depends on the previous values — one simple but clear example is the algorithm to calculate the Fibonacci number. An exemplary implementation is below. It is impossible in this case to break the chain of calculations and run them in parallel.

Another example of such algorithm is a recursive calculation of a factorial (see below).

Memory-Intensive Tasks

There are tasks where the memory access time is a bottleneck, not computations themselves. CPUs usually have larger cache sizes (fast memory access element) than GPUs and have faster memory subsystems which allow them to excel at manipulating frequently accessed data. A simple example can be an element-wise addition of large arrays.

However, in many cases, popular frameworks (like Pytorch) will perform such calculations on GPU faster by moving the objects to the GPU’s memory and parallelizing operations under the hood.

We can create a process where we initialize arrays in RAM and move them to the GPU for calculations. This additional overhead of transferring data causes end-to-end processing time to be longer than when running it directly on the CPU.

That’s when we usually use so-called CUDA-enabled arrays — in this case, using Pytorch. You must only make sure that your GPU can handle this size of data. To give you an overview — typical, popular GPUs have a memory size of 2–6GB VRAM, while the high-end ones have up to 24GB VRAM (GeForce RTX 4090).

Other Non-parallelizable Algorithms

There is a group of algorithms that are not recursive but still cannot be parallelized. Some examples are:

Gradient Descent — used in optimization tasks and machine learning
Hash-chaining — used in cryptography

The Gradient Descent cannot be parallelized in its vanilla form, because it is a sequential algorithm. Every iteration (called a step) depends on the results of the previous one. There are, however, some studies on how to implement this algorithm in a parallel manner. To learn more check:

An example of the Hash-chaining algorithm you can find here: https://www.geeksforgeeks.org/c-program-hashing-chaining/

Small tasks

Another case when CPUs are a better choice is when the data size is very small. In such situations, the overhead of transferring data between the RAM and GPU memory (VRAM) can outweigh the benefit of GPU parallelism. This is because of the very fast access to the CPU cache. It was mentioned previously in a section related to memory-intensive tasks.

Also, some tasks are simply too small and although the calculations can be run in parallel, the benefit to the end user is not visible. In such cases running on GPU generates only the additional hardware-related costs.

That’s why in IoT, GPUs are not commonly used. Typical IoT tasks are:

to capture some sensor data and send them over
to activate other devices (lights, alarms, motors, etc.) after detecting a signal

However, in this field GPUs are still used in so-called edge-computing tasks. These are the situations when you have to acquire and process data directly at its source instead of sending them over the Internet for heavy processing. A good example is iFACTORY developed by BMW.

Task with small level of parallelization

There are numerous use cases where you have to run the code in parallel but due to the speed of CPU it is enough to parallelize the process using multi-core CPU. GPU excel in situations where you need a massive parallelization (hundreds or thousands of parallel operations). In cases where you find that, e.g. 4x or 6x speed up is enough you can reduce costs by running the code on CPU, each process on different core. Nowadays, manufacturers of CPU offer them with between 2 and 18 cores (e.g. Intel Core i9–9980XE Extreme Edition Processor).

Summary

Overall, the rule of thumb when choosing between CPU and GPU is to answer these main questions:

Can a CPU handle the entire task within the required time?
Can my code be parallelized?
Can I fit all the data on a GPU? If not does it introduce a heave overhead?

To answer these questions, its crucial to understand well both how your algorithms work and what are the business requirements now and how can they change in the future.

When to Run Code on CPU and Not GPU: Typical Cases

How to choose the hardware to optimize the computation for your use case

Written by Robert Kwiatkowski