Building our own GPU server for training deep learning models with TensorFlow

This post aims to share our experience setting up our deep learning server – thanks to NVidia for the two Titan X Pascal, but also thanks to the Maria de Maeztu Research Program for the machine ! 🙂 The text is divided in two parts: bringing the pieces together, and install TensorFlow. Let’s start!

Part 1: Bringing the pieces together

I have to say that it is not as hard as it seems. Basically: buy as much disk and ram you need, and consider the following tips.

Tip when choosing motherboard and CPU: check the number of PCI Express Lanes your CPU/motherboard support. Ideally, you want a machine where each PCIe slot has 16 lanes supported – e.g., for our two GPUs, we want 2x 16 PCIe lanes. However, most machines only support 2x 8 PCIe lanes – what is not a bad deal since, according to Tim Dettmers, having 8 lanes per card should only decrease performance by “0–10%” (for two GPUs). Also note that most machines allow 1x 16 and 1x 4 PCIe lanes (simultaneously) due to the limitations of the CPU or the motherboard – what might compromise the speed of one of the GPUs. For that reason, it is important to check the specifications of CPU/motherboard. +info: link.

Tip when choosing a power supply: consider buying a 1000W power supply. Depending on how many cards you wish to install in your computer, you might need more or less power. But for you to get an idea, each GPU approximately consumes 250W (with peaks of 300W) and the remaining components might consume 250W.

Due to the PCIe lanes and power supply constraints we concluded that having more than two GPUs is possibly too much for a single (cheap) machine.

This is our final list of components:

  • 2x Titan X Pascal (GPUs)
  • Intel Core i7 7700 (CPU)
  • Asus Prime Z270-A (motherboard)
  • Power supply of 1000W
  • 32GB RAM
  • 250GB SSD
  • 12TB HDD
  • Big case

Now you just need to ensemble the computer!

Part 2: Install GPU accelerated TensorFlow

We will be focusing in how to install TensorFlow, but most of the following steps might be similar for installing other deep learning frameworks. Before anything, just install your favorite Linux distro into this computer – we installed the last Ubuntu.

Install nvidia drivers:

$ sudo add-apt-repository ppa:graphics-drivers
$ sudo apt-get update
$ sudo apt-get install nvidia-390

Tip: the number -390 might change. This number denotes the software version, just pick the latest one.

We always reboot after installing the nvidia drivers.

Check which version of TensorFlow you wish to install

Access to the releases page of the Github repository of TensorFlow for knowing which version of CUDA and cuDNN the TensorFlow verison you are about to install supports. Otherwise, you will be in trouble when importing TensorFlow! CUDA and cuDNN are the software used to accelerate the computations via the nvidia GPU cards.

For example, in our case, we will be installing TensorFlow 1.5.0 and its prebuilt binaries are built against CUDA 9 and cuDNN 7. Therefore, we better download these versions of CUDA and cuDNN – not CUDA 9.1, nor cuDNN 6!

Install CUDA

Go to nvidia’s webpage and follow the instructions for downloading and installing. We generally download the .deb file, but other installing options are available. In case your TensorFlow version requires an older version of CUDA, click to ‘Legacy releases’ button to download previous versions of CUDA.

Install  CUDA – we opted to also install the cuBLAS Patch Update:

$ sudo dpkg -i cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
$ sudo dpkg -i cuda-repo-ubuntu1704-9-0-local-cublas-performance-update_1.0-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-9-0-local-cublas-performance-update-1/7fa2af80.pub
$ sudo apt-get update
$ sudo apt-get install cuda

We always reboot after installing CUDA.

Tip: type nvidia-smi into your shell for checking if CUDA has been successfully installed.

Install cuDNN

Download cuDNN from nvidia’s website (you will have to register, but it is very quick). Remember to download the cuDNN version that your TensorFlow version requires, and install it!

$ sudo dpkg -i libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb
$ sudo dpkg -i libcudnn7-dev_7.0.5.15-1+cuda9.0_amd64.deb
$ sudo dpkg -i libcudnn7-doc_7.0.5.15-1+cuda9.0_amd64.deb
Note that we install cuDNN from .deb files, but some others prefer installing it from a .tgz.

Install TensorFlow

It’s very simple:

$ pip install tensorflow-gpu

If all went well, you now should be able to import tensorflow from your python shell!

The mnist examples from the TensorFlow repository are very handy for trying your GPUs, just download one example and run it:

$ CUDA_VISIBLE_DEVICES=0 python mnist_softmax.py

+ some additional resources: How to build a deep learning server from scratch? or the TensorFlow installation guide.
+ You could also visit Exxact for their NVIDIA deep learning GPU solutions. They provide deep learning GPU workstations and servers pre-installed with the latest deep learning platforms such as Caffe, Tensorflow, Torch, and more.

Bonus track: I just want this computer as a server!

Install OpenSSH server for accessing to the machine via ssh:

$ sudo apt-get install openssh-server

 

And disable your graphical interface (an easy way to avoid Xorg running in your GPUs):

$ sudo systemctl isolate multi-user.target

reenable with:

$ sudo systemctl isolate graphical.target

 

You can disable it permanently by running the following command:

$ sudo systemctl set-default multi-user.target

reenable with:

$ sudo systemctl set-default graphical.target