DATA SCIENCE

If you search on the internet and type "Building a PC", there are thousands of Game Lovers who build their own PC with passion. What about Data Lovers? Why don’t they do the same? Mostly because they are in love with their laptops, cloud computing services (like AWS EC2), or their employers’ computing servers. Although cloud computing services and on-premise servers could be good options for large projects, for many smaller projects (like your personal side projects), building a PC is a good long-term investment.
For this good reason, I did extensive research on available hardware options and tried to pick the best and affordable (and of-course compatible) hardware components. Basically, the art of building a PC (for either data scientists and gamers) is not how to connect wires or install devices (which is by the way super easy), but it is important to find hardware components that are compatible in performance and none of them create a bottleneck. The worst thing is to waste money on an expensive hardware component that you cannot get maximum performance from it because of the computational bottleneck caused by other components.
In this article, I will show you how to build an affordable PC for your next Data Science side project. I’ll go through all the necessary components and try to explain what is THE BEST BANG FOR YOUR BUCK.
GPU
Let’s start with the most important component for data scientists (especially those who work on Deep Learning projects). As you guessed correct, I am talking about GPU. This component is such an important one that I dedicated a whole article to that. You can read my article here. As I suggested in that article, it seems that NVIDIA GeForce RTX 2060 is an affordable option for starting. GeForce RTX 2060 enjoys Turing GPU technology (one of the latest GPU architecture technologies developed by NVIDIA) as well as 240 Tensor cores (that makes DL training much faster compared to GTX GPUs). The only issue with RTX 2060 is the memory size. I would like to expand on this issue.
During training GPU memory (also called VRAM) stores the following types of data (source):
- Neural Network Parameters: The weights and biases of the network.
- Optimizer’s Variables: Per-algorithm intermediate variables (e.g. momentums).
- Intermediate Calculations: Values from the forward pass that are temporarily stored in Gpu memory and then used in the backward pass.
- Workspace: Temporary memory for local variables of kernel implementations.
Some state of arts deep learning model definitely needs more than 8GB of VRAM (source). However, simpler models can be fitted with 8GB or less of VRAM. GeForce RTX 2060 has 6 GB GDDR6 memory. As I said, it is enough for many small projects (that I suppose is your interest here), but it might not be enough for larger projects.
Again for the purpose of building an affordable PC, still I suggest GeForce RTX 2060 which you can buy it $330 as of November 2020.

CPU
Before choosing a CPU for our data science PC, we must think about the importance of this component in machine learning, especially deep learning algorithms. Most machine learning algorithms (including Neural Networks) are embarrassingly parallel (opposite to inherently sequential methods). Therefore, GPU has more effect on improving these kinds of problems compared to CPU. Especially for Deep Learning training (which is probably the most time-consuming aspect of a project), normally, CPUs do not contribute to the process directly. Their most important computational contribution is pre-processing in such projects. As long as we do the pre-processing on the whole dataset first (not each batch separately), the CPU performance could not be a bottleneck during the training process. Therefore, we don’t need a high-end CPU for our Data Science PC.
When it comes to a good price and good performance, AMD is a winner in my opinion. For our system, Intel CPUs are overkill and overpriced. My recommendation is to go with AMD Ryzen 5 3600 ($199 as of November 2020). It has 6 cores and 12 threads that makes it more than enough for any small to medium-scale deep learning project.

RAM
When you are using GPU for deep learning, training data are loaded from disk to RAM, and then it will be sent to the graphic card or GPU memory (VRAM). Therefore, a very large RAM will not increase your training speed or does not let you train larger batches. In fact, you need just a little bit over your VRAM capacity for RAM to not make a process bottleneck in your VRAM. I suggest to go safe and buy twice as your VRAM for your RAM. In our case, since we are using RTX 2060 with 6GB of graphic memory, It is better to buy 16GB of DDR4 memory. My recommendation is Corsair Vengeance LPX 16GB (2x8GB) DDR4 DRAM 3200MHz ($70 as of November 2020).
NOTE: If you are working on genetic data science projects or any project with a large data set and heavy pre-processing, I advise you to increase the memory to at least 32GB.

Motherboard
Normally your choice of motherboard depends on your CPU choice. There are a few (compared to Intel CPUs) chipsets that are compatible with AMD CPUs. MSI, GIGABYTE, and ASUS are big names in building reliable motherboards (including AMD chipset motherboards). To keep my cost down, I chose MSI B450M Pro-VDH Max. This motherboard has full compatibility with the parts that I chose for my PC. As of November 2020, it cost about $85. If you are planning to upgrade your GPU in the future, keep in mind that series B450 might not support your GPU. NVIDIA recently announced that series B450 motherboards still will support their latest GeForce Series 30 series, but there is no guarantee that this happens for the next generation of NVIDIA GPUs.

Storage Disks
Data science projects could be very large and we need large disk space. At the same time, we need fast storage disks that do not act as a bottleneck to the process. We currently have three different types of storage disks:
- HDD: These are the most affordable options for data storage. You may buy terabytes of HDD for as low as ~$25 per TB. The biggest drawback of an HDD disk is its read and write speed. Typical read/write speeds for an HDD are 80-200 MB/sec which is mainly due to the mechanical mechanism of this kind of device.
- SSD: Solid State Drives provides a faster data transmission by providing up to 2000/1200 MB/sec read and write speed. They are expensive than HDD disks, but in the last few years, their price has reduced significantly.
- NVMe (Non-Volatile Memory Express): These are the fastest and the most expensive storage disks for PCs. On average they provide 3500/3300 MB/sec read and write speed. Some new NVMe disks can reach up to 5000/4400 MB/sec read and write speed.
To satisfy both budget and performance, people usually go with two kinds of drives. We normally store non-active data on an HDD disk and save money by buying a smaller SSD or NVMe disk. Remember to install the operating system and codes on either SSD or VNMe drives. Also, move your data from HDD to SSD or VNMe before running the code.
I highly recommend Seagate BarraCuda 2TB ($55 as of November 2020) as your HDD drive (or you may buy the 4TB for $99). Also, my recommendation is to buy a VNMe instead of an SSD since the price difference is less than the performance difference. For VNMe, my recommendation is Samsung 970 EVO Plus SSD 500GB M.2 ($90 as of November 2020). Samsung is a trusted brand in the SSD world and therefore I prefer to go with this option instead of cheaper options. Also, remember that for this VNMe you need a motherboard that supports M.2 port. MSI B450 motherboard that I recommended earlier supports this port.

Fans
It might not seem obvious but fans are super important for this kind of PCs. Normally, deep learning training is a heavy process for GPUs that lasts hours if not days. Therefore, there is a good chance that Thermal Throttling happens during this process. When the GPU takes on a heavy workload, it generates a lot of heat. If your PC cooling system cannot cool down the GPU (or CPU) fast enough to keep temperatures within a safe range, your GPU (or CPU) starts to slow down to cool down and survive. Thermal throttling can happen frequently if your PC does not provide good ventilation and cooling system for GPU and CPU. To avoid it, the best and most affordable solution is to install as many fans as you can. It is not necessary, but I decided to go with RGB fans to give a little bit of vibe to my system as well as good ventilation. I recommend upHere RGB Case Fan (pack of 5) which is about $40 as of November 2020.

PSU (Power Supply Unit)
This component provides power to all previously mentioned components and lack of enough power can cause slow down or permanent damages to them. You must provide enough power plus a margin of safety to make sure your PC is running at its maximum capacity. PSUs are coming in different wattages and to calculate the necessary power for your PC, I recommend using https://pcpartpicker.com/list/.
For the system that I recommended, we need less than 450W of power. My recommendation for PSU is Corsair CV Series, CV450, 450 Watt ($55 as of November 2020).

Case
There are only two considerations for this component. 1) It must have enough room to accommodate all Hardware components. 2) It must provide good ventilation. My recommendation is to go with a mid-tower case like NZXT H510 ($70 as of November 2020).

Conclusion
I worked on this system for two months to make sure it works as it is expected before suggesting the system to anyone else. During the last two months, there was only one situation that I noticed that my system needs an update. As mentioned, if you are working on big data with heavy pre-processing, like DNA data, you need more memory. I was working on a genetic data science project and I felt that my system needs more RAM. Therefore, I upgraded my RAM to 32GB and it is working perfectly.

The whole system costs below $1000. For me, it was one of the best $1000 that I spent and invested in myself and personal projects. Let me know what do you think and if you have a different opinion.
