Feature Request: NIVIDIA Multi Instance GPU (MIG) Support

timeu · October 31, 2024, 1:49pm

We have some A100 NIVIDIA GPUs (80GB) and we would like to use the MIG support of those cards to accelerate the Cryosparc Live workloads by allowing more concurrent jobs.

Unfortunately, currently Cryosparc doesn’t support MIGs because it uses CUDA to discover the available MIGs and currently CUDA can only enumerate a single MIG instance (typically the first one). Note: Speicfic MIG instances can be targetted by setting the CUDA_VISIBLE_DEVICES env variable to the MIG ID (i.e. MIG-xxxx).

As far as we can tell from looking into this, Cryosparc needs 2 changes to support MIG:

1.) If MIG support is enabled on the worker node, leverage a different way to enumerate the MIG instances (nvidia-smi instead of cuda/numba) or alternatively allow the operator to statically define the list of MIG instances. Currenntly Cryosparc stores the GPU nodes with a numeric slot id in the database. This would have to be extended to allow to store the MIG ID as a string.
2.) In case of MIG support, set the CUDA_VISIBLE_DEVICES to the ID of the selected MIG instance. The cuda application should not require any changes as it would discover the MIG instance that was enalbed via the CUDA_VISIBLE_DEVICES env variable.

One thing to keep in mind is, that MIGs only support single job/process per MIG instance. So this would have to be taken into account when scheduling jobs.

wtempel · November 11, 2024, 5:07pm

@timeu For CryoSPARC Live (and non-Live CryoSPARC jobs that use no more than one GPU), you may want to try

configuring the host(s) with MIG-partitioned A100 devices as compute nodes of a cluster that is controlled by a workload manager that supports and is configured for NVIDIA MIGs and cgroup-based resource isolation, like slurm.
connecting the suitably configured cluster to CryoSPARC.
queuing CryoSPARC Live workloads or single-GPU CryoSPARC jobs to the cluster lane.

Please can you update this forum topic with your findings.

swarsel · December 12, 2024, 1:34pm

Hello. After having implemented your suggestions, I would like to update you with some of our findings.

We set up a SLURM cluster for which we configured the MIG devices (SLURMs ‘MultipleFiles’ GRES config did not work as expected, but nvidia / hpc / slurm-mig-discovery · GitLab led to success)
We were able to setup and connect the cluster without much issue and the system has been running smoothly since

One thing we were not very happy with (although this is not directly related to this topic) is the cryosparcm cluster validate utility, which kept failing during our initial testing. The issue, it seems, was that the validation failed to properly populate the j2 templates in the submission script, while (as we later found out) it worked fine on the real system. For example, the slurm example submit script would render the following:

#!/usr/bin/env bash

#SBATCH --job-name cryosparc__
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:0
#SBATCH --mem=0G
#SBATCH --comment="created by "
#SBATCH --output=/__slurm.out
#SBATCH --error=/__slurm.err

[...]

We had assumed that the validation utility would provide default values for all basic variables, but it seems defaults must be provided across the whole file. We did not look further into this however.

If there are further questions regarding the way we set this up, I will gladly do my best to answer.

wtempel · December 12, 2024, 3:19pm

Thanks for this update. Please can you describe the configuration and capabilities of your current setup:

SLURM version and installation method
in case you compiled SLURM yourself, the version of any CUDA libraries used
nvidia driver version
have you attempted to run any multi-GPU CryoSPARC jobs

swarsel · December 12, 2024, 4:23pm

SLURM 24.05.4, deployed using ansible following the official installation instructions
we built from rpm
Driver Version: 565.57.01, CUDA Version: 12.7
We have not yet run multi-gpu jobs, but have planned to test this in the near future. I will update you once we have some data here.

ebirn · December 13, 2024, 2:46pm

quick update to the multi-gpu job:
Attached to a live session we have a 2D Classification job running right now, that is using 3 GPUs. It’s using multiple MIG devices in the slurm job, and nvidia-smi shows the job’s process an all 3 devices.

wtempel · December 13, 2024, 3:02pm

Thanks @ebirn. Are the 3 "GPU"s all MIG “sub-devices” on the same “physical” GPU device? Would you be willing to show nvidia-smi output on the GPU host to show

device model
MIG configuration

ebirn · December 13, 2024, 3:31pm

Sure, In this case it was 2 physical devices, with 1+2 MIGs

$ nvidia-smi
Fri Dec 13 16:22:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     On  |   00000000:12:00.0 Off |                    0 |
|  0%   55C    P0             84W /  300W |     281MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  |   00000000:49:00.0 Off |                   On |
| N/A   48C    P0             67W /  300W |     567MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          On  |   00000000:89:00.0 Off |                   On |
| N/A   56C    P0             81W /  300W |     563MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A40                     On  |   00000000:C2:00.0 Off |                    0 |
|  0%   56C    P0             91W /  300W |     281MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  1    2   0   0  |             229MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 2MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    3   0   1  |             170MiB / 19968MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 2MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    4   0   2  |             170MiB / 19968MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 2MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    2   0   0  |             229MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 2MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    3   0   1  |             170MiB / 19968MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 2MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    4   0   2  |             166MiB / 19968MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 2MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    385126      C   python                                        272MiB |
|    1    2    0     385965      C   python                                        180MiB |
|    1    3    0     385123      C   python                                        136MiB |
|    1    4    0     385120      C   python                                        136MiB |
|    2    2    0     385965      C   python                                        180MiB |
|    2    3    0     385118      C   python                                        136MiB |
|    2    4    0     385965      C   python                                        130MiB |
|    3   N/A  N/A    385117      C   python                                        272MiB |
+-----------------------------------------------------------------------------------------+
# The 2D class job is on physical IDs 1,2 
$ nvidia-smi | grep 385965
|    1    2    0     385965      C   python                                        180MiB |
|    2    2    0     385965      C   python                                        180MiB |
|    2    4    0     385965      C   python                                        130MiB |

We use nvidia-mig-manager (GitHub - NVIDIA/mig-parted: MIG Partition Editor for NVIDIA GPUs) to restore the MIG profiles after reboot, the active config profile see below. In the machine there are 2x A100 and 2x A40 GPUs - only the A100 are MIG capable:

  krios:
    - devices: [1,2]
      mig-enabled: true
      mig-devices:
        "2g.20gb": 2
        "3g.40gb": 1