Feature Request: NIVIDIA Multi Instance GPU (MIG) Support

We have some A100 NIVIDIA GPUs (80GB) and we would like to use the MIG support of those cards to accelerate the Cryosparc Live workloads by allowing more concurrent jobs.

Unfortunately, currently Cryosparc doesn’t support MIGs because it uses CUDA to discover the available MIGs and currently CUDA can only enumerate a single MIG instance (typically the first one). Note: Speicfic MIG instances can be targetted by setting the CUDA_VISIBLE_DEVICES env variable to the MIG ID (i.e. MIG-xxxx).

As far as we can tell from looking into this, Cryosparc needs 2 changes to support MIG:

1.) If MIG support is enabled on the worker node, leverage a different way to enumerate the MIG instances (nvidia-smi instead of cuda/numba) or alternatively allow the operator to statically define the list of MIG instances. Currenntly Cryosparc stores the GPU nodes with a numeric slot id in the database. This would have to be extended to allow to store the MIG ID as a string.
2.) In case of MIG support, set the CUDA_VISIBLE_DEVICES to the ID of the selected MIG instance. The cuda application should not require any changes as it would discover the MIG instance that was enalbed via the CUDA_VISIBLE_DEVICES env variable.

One thing to keep in mind is, that MIGs only support single job/process per MIG instance. So this would have to be taken into account when scheduling jobs.

@timeu For CryoSPARC Live (and non-Live CryoSPARC jobs that use no more than one GPU), you may want to try

  1. configuring the host(s) with MIG-partitioned A100 devices as compute nodes of a cluster that is controlled by a workload manager that supports and is configured for NVIDIA MIGs and cgroup-based resource isolation, like slurm.
  2. connecting the suitably configured cluster to CryoSPARC.
  3. queuing CryoSPARC Live workloads or single-GPU CryoSPARC jobs to the cluster lane.

Please can you update this forum topic with your findings.