Multi-instance GPU support (Nvidia A100)

tarek · August 13, 2021, 12:03pm

Hi all,

I am wondering if anyone has experience running cryosparc on virtual GPUs as available through MIG.
So far I was not able to communicate between cryosparc and the instances.

Best,
Tarek

qitsweauca · September 13, 2021, 2:12pm

Hi @tarek

Have you assigned unique ID for each MIG device with UUID?
and then add the environment variables?

Cheers,
qitsweauca

tarek · September 13, 2021, 6:12pm

Yes.

user@gpu:~$ nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-90134985-26d8-39db-81bd-61f9a31864fe)
  MIG 2g.10gb Device 0: (UUID: MIG-GPU-90134985-26d8-39db-81bd-61f9a31864fe/3/0)
  MIG 2g.10gb Device 1: (UUID: MIG-GPU-90134985-26d8-39db-81bd-61f9a31864fe/4/0)
  MIG 2g.10gb Device 2: (UUID: MIG-GPU-90134985-26d8-39db-81bd-61f9a31864fe/5/0)
GPU 1: A100-PCIE-40GB (UUID: GPU-0d407aac-ca34-f203-ab5b-b57b784d5074)

and here

user@gpu:~$ CUDA_VISIBLE_DEVICES=MIG-GPU-90134985-26d8-39db-81bd-61f9a31864fe/3/0,MIG-GPU-90134985-26d8-39db-81bd-61f9a31864fe/4/0,MIG-GPU-90134985-26d8-39db-81bd-61f9a31864fe/5/0 /srv/public/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
  Detected 1 CUDA devices.

   id           pci-bus  name
   ---------------------------------------------------------------
       0      0000:01:00.0  A100-PCIE-40GB MIG 2g.10gb
   ---------------------------------------------------------------

am I missing something?

tarek · September 30, 2021, 6:50am

I just found out that this is a current limitation of CUDA11, only a single instance can be assigned to CUDA. Seems like we have to wait for future CUDA releases…

### [CUDA Device Enumeration](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-visible-devices)

MIG supports running CUDA applications by specifying the CUDA device on which the application should be run. With CUDA 11, only enumeration of a single MIG instance is supported.

CUDA applications treat a CI and its parent GI as a single CUDA device. CUDA is limited to use a single CI and will pick the first one available if several of them are visible. To summarize, there are two constraints:

1. CUDA can only enumerate a single compute instance
2. CUDA will not enumerate non-MIG GPU if any compute instance is enumerated on any other GPU

Note that these constraints may be relaxed in future NVIDIA driver releases for MIG.

michael.gajhede · November 14, 2023, 3:14pm

Has anyone gotten MIG to work with CUDA 12?
All the best
Michael