How to set up a GPU VM with Ubuntu 18 on Oracle Cloud

Luigi Saetta
Towards Data Science
6 min readOct 1, 2019

--

Introduction

If you want to develop a Deep Learning model and want a good likelihood to be successful, you need two things:

  • lots of data, for the training of the model
  • enough computational power, in the form of GPUs

A Deep Learning model commonly is based on a Deep (many layers) Neural Network. The algorithm commonly used for the training of the model is based on some form of the “back-propagation” algorithm and it requires many “tensor” operations. For this kind of operations, a GPU is much more effective than a CPU since it can execute with a degree of parallelism that you can’t achieve even with a modern CPU. Nvidia P100 has 3584 core and is capable of 5.3 TeraFlops, just as an example.

As a result, if you adopt modern GPUs you will be able to reduce the time needed to train your model of at least one order of magnitude, compared to the time needed using CPUs.

But GPUs are still expensive. Also, in this area, the development is strong and your GPU can become quickly old.

Therefore, the choice of adopting a Cloud-based environment, where you only pay for the use, is nowadays the best option. Just as an example, on Oracle Cloud, you can use a VM with one NVidia P100 (16 GB), 12 OCPU and 72 GB Ram (VM.GPU2.1 shape) for about 30 $/day. And you can get shapes with more GPUs if needed.

Having said that, the correct set-up of the environment for TensorFlow is not the simplest thing to do and you risk not being able to make full use of your GPU. I have done some research online and discovered that the documentation is not perfect. For this reason, I have decided to write this note about how to set up an Ubuntu 18 environment for TensorFlow and GPU.

The environment.

As I said, I’m focusing on setting up a VM in Oracle Cloud Infrastructure (OCI) and I want to use these components:

  • Ubuntu 18.04 LTS
  • Anaconda Python distribution
  • Tensorflow
  • Jupyter Notebook

What is a little bit complicated is the correct alignment between OS libraries, Nvidia drivers for the GPU, the CUDA toolkit version and TensorFlow version. If all these things are not correctly aligned, you risk to have an environment where your TensorFlow program will run correctly, but without using the GPU and with much longer execution time. Not exactly what you want.

I have documented here the simplest series of steps I have found till now, and, honestly, I started writing this note as a personal one. Then I decided that probably it is worth to share.

The VM.

From the OCI Console, the settings I have chosen for the creation of the VM are the followings:

  • Shape: VM.GPU2.1 (1 GPU, 12 OCPU)
  • OS: Ubuntu 18.04
  • Public IP: Yes
  • Boot volume: 100 GB of disk space

and I have added a public key, to be able to connect from remote, using ssh.

The creation of the VM is fast. Good.

VM OS setup.

The first thing to do is to update the list of packages available:

sudo apt update

Then, since we want to use Jupyter Notebook, we need to open in the on-board firewall the port 8888. The recommended way is the following:

sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 --dport 8888 -j ACCEPT
sudo service netfilter-persistent save

Do not use ufw. You risk canceling some settings needed to connect the VM to the storage.

After this, we need to add an inbound network security rule, to enable connection to port 8888 from the browser:

Log into OCI console.Under Core Infrastructure go to Networking and then Virtual Cloud Networks.Select the right cloud network.Under Resources click on Security Lists and then the security list that you’re interested in.Under Resources, click Ingress Rules and then Add Ingress Rule.Enter 0.0.0.0/0 for the Source CIDR, TCP for the IP Protocol, and 8888 for the destination port.At the bottom of the page, click Add Ingress Rules.

At this point, we need to install the right Nvidia drivers. This is a crucial point where I have lost some time. The easiest way I have found, for Ubuntu 18, is the following:

sudo apt install ubuntu-drivers-commonsudo ubuntu-drivers autoinstall

After the reboot (mandatory), you could check that drivers are correctly installed using the following command (the output is reported). From the command, you can also get drivers’ version.

nvidia-smiMon Sep 30 13:34:03 2019+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 || — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 Tesla P100-SXM2… Off | 00000000:00:04.0 Off | 0 || N/A 37C P0 24W / 300W | 0MiB / 16280MiB | 0% Default |+ — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — ++ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+| Processes: GPU Memory || GPU PID Type Process name Usage ||=============================================================================|| No running processes found |

The “nvidia-smi” command will be also useful in the future to check that GPU is used during the training process.

SW installation.

Next step is to install the Anaconda Python distribution.

At the moment of writing, the “July version” is the latest available, but you should check on the Anaconda site.

wget https://repo.continuum.io/archive/Anaconda3-2019.07-Linux-x86_64.shbash Anaconda3–2019.07-Linux-x86_64.sh -becho -e ‘export PATH=”$HOME/anaconda3/bin:$PATH”’ >> $HOME/.bashrcsource ~/.bashrc

It is worthwhile to do a last update of the Anaconda distribution, with the command:

conda update -n base -c defaults condaconda init bash
source ~/.bashrc

At this point, I get the 4.7 version. The last two commands enable the use of “conda activate” command.

Next, we need to create a new “conda” virtual environment, that we’re going to call “gpu”

conda create --name gpu
conda activate gpu

After, we can install Python in the created “gpu” env:

conda install python

The next important step is TensorFlow installation. Here, we want to install the version supporting GPUs. It is important to use conda and not pip, for installation, since it ensures that all dependencies are correctly installed and met.

conda install tensorflow-gpu

Next, Jupyter installation:

conda install nb_condajupyter notebook --generate-config

After, you need to write the following lines in the Jupyter’s configuration

In /home/ubuntu/.jupyter/jupyter_notebook_config.py file, add the following linesc = get_config()
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

In this way, you enable connection to Jupyter from outside the VM (any IP address).

Then, since you want connections only through SSL, you can generate a self-signed certificate using OpenSSL:

openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout jupyter-key.key -out jupyter-cert.pem

After all these steps, you can start Jupyter using the following command

jupyter notebook — certfile=jupyter-cert.pem — keyfile=jupyter-key.key

(eventually, you could add nohup).

The browser will give you a security warning, but you understand what you’re doing and can continue right away.

Test your GPU.

Now, it is finally time to check if everything is OK. We want to be sure, before starting training a complex Deep Learning model, that TensorFlow is going to use GPU. So, we test using TensorFlow.

In a Notebook cell type:

import tensorflow as tffrom time import strftime, localtime, timewith tf.Session() as sess:devices = sess.list_devices()devices

In the output, you should see, in the list of available devices, GPU.

Then as a final check (I found it on StackOverflow)

import tensorflow as tfif tf.test.gpu_device_name():print(‘Default GPU Device: {}’.format(tf.test.gpu_device_name()))else:print(“Please install GPU version of TF”)

You want to see the output of the first “print” statement.

As final checks:

  • have a look at the logs produced by Jupyter;
  • run your model and at the same time execute the “nvidia-smi” command; You should see a “Volatile GPU utilization” greater than 0

Conclusion.

If you want to seriously experiment in the Deep Learning field, you need GPUs. There are several ways to set up a Linux environment with GPU: in this note, I have documented how to set up an environment with the latest LTS Ubuntu version, on Oracle OCI.

And, more, I have documented how to be sure that TensorFlow is actually using GPU.

--

--

Born in the wonderful city of Naples, but living in Rome. Always curious about new technologies and new things., especially in the AI field.