We have some A100 NIVIDIA GPUs (80GB) and we would like to use the MIG support of those cards to accelerate the Cryosparc Live workloads by allowing more concurrent jobs.
Unfortunately, currently Cryosparc doesn’t support MIGs because it uses CUDA to discover the available MIGs and currently CUDA can only enumerate a single MIG instance (typically the first one). Note: Speicfic MIG instances can be targetted by setting the CUDA_VISIBLE_DEVICES env variable to the MIG ID (i.e. MIG-xxxx).
As far as we can tell from looking into this, Cryosparc needs 2 changes to support MIG:
1.) If MIG support is enabled on the worker node, leverage a different way to enumerate the MIG instances (nvidia-smi instead of cuda/numba) or alternatively allow the operator to statically define the list of MIG instances. Currenntly Cryosparc stores the GPU nodes with a numeric slot id in the database. This would have to be extended to allow to store the MIG ID as a string.
2.) In case of MIG support, set the CUDA_VISIBLE_DEVICES to the ID of the selected MIG instance. The cuda application should not require any changes as it would discover the MIG instance that was enalbed via the CUDA_VISIBLE_DEVICES env variable.
One thing to keep in mind is, that MIGs only support single job/process per MIG instance. So this would have to be taken into account when scheduling jobs.