The world’s leading publication for data science, AI, and ML professionals.

How to Set Up a Multi-GPU Linux Machine for Deep Learning in 2024

Super-fast setup of CUDA and PyTorch in minutes!

DEEP LEARNING WITH MULTIPLE GPUS

Image by Author: Multi-GPU machine (cartoon)
Image by Author: Multi-GPU machine (cartoon)

As Deep Learning models (especially LLMs) keep getting bigger, the need for more GPU memory (VRAM) is ever-increasing for developing them and using them locally. Building or obtaining a multi-GPU machine is just the first part of the challenge. Most libraries and applications only use a single GPU by default. Thus, the machine also needs to have appropriate drivers along with libraries that can leverage the multi-GPU setup.

This story provides a guide on how to set up a multi-GPU (Nvidia) Linux machine with important libraries. This will hopefully save you some time on experimentation and get you started on your development.

At the end, links are provided to popular open-source libraries that can leverage the multi-GPU setup for Deep Learning.

Target

Set up a Multi-GPU Linux system with necessary libraries such as CUDA Toolkit and PyTorch to get started with Deep Learning 🤖. The same steps also apply to a single GPU machine.

We will install 1) Cuda Toolkit, 2) PyTorch and 3) Miniconda to get started with Deep Learning using frameworks such as exllamaV2 and torchtune.

©️ All the libraries and information mentioned in this story are open-source and/or publicly available.

Getting Started

Image by Author: Output of the nvidia-smi command on a Linux Machine with 8 Nvidia A10G GPUs
Image by Author: Output of the nvidia-smi command on a Linux Machine with 8 Nvidia A10G GPUs

Check the number of GPUs installed in the machine using the nvidia-smi command in the terminal. It should print a list of all the installed GPUs. If there is a discrepancy or if the command does not work, first install the Nvidia drivers for your version of Linux. Make sure the nvidia-smi command prints a list of all the GPUs installed in your machine as shown above.

Follow this page to install Nvidia Drivers if not done already:

How to install the NVIDIA drivers on Ubuntu 22.04 – Linux Tutorials – Learn Linux Configuration– (Source: linuxconfig.org)

Step-1 Install CUDA-Toolkit

💡 Check for any existing CUDA folder at usr/local/cuda-xx. That means a version of CUDA is already installed. If you already have the desired CUDA toolkit installed (check with the nvcc command in your terminal) please skip to Step-2.

Check the CUDA version needed for your desired PyTorch library: Start Locally | PyTorch (We are installing Install CUDA 12.1)

Go to CUDA Toolkit 12.1 Downloads | NVIDIA Developer to obtain Linux commands to install CUDA 12.1 (choose your OS version and the corresponding "deb (local)" installer type).

Options selected for Ubuntu 22 (Source: developer.nvidia.com)
Options selected for Ubuntu 22 (Source: developer.nvidia.com)

The terminal commands for the base installer will appear according to your chosen options. Copy-paste and run them in your Linux terminal to install the CUDA toolkit. For example, for x86_64 Ubuntu 22, run the following commands by opening the terminal in the downloads folder:

wget https://developer.download.Nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

⚠️While installing the CUDA toolkit, the installer may prompt a kernel update. If any pop-up appears in the terminal to update the kernel, press the esc button to cancel it. Do not update the kernel during this stage!— it may break your Nvidia drivers ☠️.

Restart the Linux machine after the installation. The nvcc command will still not work. You need to add the CUDA installation to PATH. Open the .bashrc file using the nano editor.

nano /home/$USER/.bashrc

Scroll to the bottom of the .bashrc file and add these two lines:

 export PATH="/usr/local/cuda-12.1/bin:$PATH"
 export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH"

💡 Note that you can change cuda-12.1 to your installed CUDA version, cuda-xx if needed in the future , ‘xx’ being your CUDA version.

Save the changes and close the nano editor:

 To save changes - On you keyboard, press the following: 

 ctrl + o             --> save 
 enter or return key  --> accept changes
 ctrl + x             --> close editor

Close and reopen the terminal. Now the nvcc--version command should print the installed CUDA version in your terminal.

Step-2 Install Miniconda

Before we install Pytorch, it is better to install Miniconda and then install PyTorch inside a Conda environment. It also is handy to create a new Conda environment for each project.

Open the terminal in the Downloads folder and run the following commands:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

# initiate conda
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Close and re-open the terminal. Now the conda command should work.

Step-3 Install PyTorch

(Optional) – Create a new conda environment for your project. You can replace <environment-name> with the name of your choice. I usually name it after my project name. __ 💡 You can use the conda activate <environment-name> and conda deactivate <environment-name> commands before and after working on your project.

conda create -n <environment-name> python=3.11

# activate the environment
conda activate <environment-name>

Install the PyTorch library for your CUDA version. The following commands are for cuda-12.1 which we installed:

pip3 install torch torchvision torchaudio

The above command is obtained from PyTorch installation guide – Start Locally | PyTorch .

(Source: pytorch.org)
(Source: pytorch.org)

After PyTorch installation, check the number of GPUs visible to PyTorch in the terminal.

python

>> import torch
>> print(torch.cuda.device_count())
8

This should print the number of GPUs installed in the system (8 in my case), and should also match the number of listed GPUs in the nvidia-smi command.

Viola! you are all set to start working on your Deep Learning projects that leverage multiple GPUs 🥳.

What Next? Get started with Deep Learning Projects that leverage your Multi-GPU setup (LLMs)

  1. 🤗 To get started, you can clone a popular model from Hugging Face:

meta-llama/Meta-Llama-3-8B · Hugging Face

  1. 💬 For inference (using LLM models), clone and install exllamav2 in a separate environment. This uses all your GPUs for faster inference: (Check my medium page for a detailed tutorial)

GitHub – turboderp/exllamav2: A fast inference library for running LLMs locally on modern…

  1. 👨 ‍🏫 For fine-tuning or training, you can clone and install torchtune. Follow the instructions to either full finetune or lora finetune your models, leveraging all your GPUs: (Check my medium page for a detailed tutorial)

GitHub – pytorch/torchtune: A Native-PyTorch Library for LLM Fine-tuning

Conclusion

This guide walks you through the machine setup needed for multi-GPU deep learning. You can now start working on any project that leverages multiple GPUs – like torchtune for faster development!

Stay tuned for more detailed tutorials on exllamaV2 and torchtune.


Related Articles