GPU workstation configuration

Hi All,

We are about to buy a new workstation for SPA data processing. If we would like to configure, for example, an 8 gpu, what is considered more optimal in general and for processing multiple jobs on different projects at the same time:

  1. Consolidate the 8 GPUs into one 4U server
    or
  2. Distribute the system to have one simple master node, and 2 x 4-GPU workers, or even 4 x 2-GPU workers.

If it’s better to distribute to multiple nodes, is a low-latency network such as Infiniband necessary to connect the nodes, or is 10/25/40Gb ethernet sufficient?

Many thanks!

I would plan this all based off of your filesystem. For instance, are you going with a network-attached filesystem? If you are, I’d distribute the GPUs into different machines. You’ll have a bit more resilience to crashes (if a job crashes it won’t take down the master). You also will have slightly better scaling in many configurations - 8 GPUs in one machine is really tough from a memory bandwith point-of-view.

As far as interconnects go - again, this probably comes down to filesystem more than anything. 40 Gb/s ethernet is 5 GB throughput. Can your filesystem sustain 5 GB/s? How about on multiple clients simultaneously? Are you caching on each GPU node, or are you going to try and get the network-attached filesystem fast enough that you don’t need a cache?