We have some A100 NIVIDIA GPUs (80GB) and we would like to use the MIG support of those cards to accelerate the Cryosparc Live workloads by allowing more concurrent jobs.
Unfortunately, currently Cryosparc doesn’t support MIGs because it uses CUDA to discover the available MIGs and currently CUDA can only enumerate a single MIG instance (typically the first one). Note: Speicfic MIG instances can be targetted by setting the CUDA_VISIBLE_DEVICES env variable to the MIG ID (i.e. MIG-xxxx).
As far as we can tell from looking into this, Cryosparc needs 2 changes to support MIG:
1.) If MIG support is enabled on the worker node, leverage a different way to enumerate the MIG instances (nvidia-smi instead of cuda/numba) or alternatively allow the operator to statically define the list of MIG instances. Currenntly Cryosparc stores the GPU nodes with a numeric slot id in the database. This would have to be extended to allow to store the MIG ID as a string.
2.) In case of MIG support, set the CUDA_VISIBLE_DEVICES to the ID of the selected MIG instance. The cuda application should not require any changes as it would discover the MIG instance that was enalbed via the CUDA_VISIBLE_DEVICES env variable.
One thing to keep in mind is, that MIGs only support single job/process per MIG instance. So this would have to be taken into account when scheduling jobs.
@timeu For CryoSPARC Live (and non-Live CryoSPARC jobs that use no more than one GPU), you may want to try
configuring the host(s) with MIG-partitioned A100 devices as compute nodes of a cluster that is controlled by a workload manager that supports and is configured for NVIDIA MIGs and cgroup-based resource isolation, like slurm.
connecting the suitably configured cluster to CryoSPARC.
queuing CryoSPARC Live workloads or single-GPU CryoSPARC jobs to the cluster lane.
Please can you update this forum topic with your findings.
Hello. After having implemented your suggestions, I would like to update you with some of our findings.
We set up a SLURM cluster for which we configured the MIG devices (SLURMs ‘MultipleFiles’ GRES config did not work as expected, but nvidia / hpc / slurm-mig-discovery · GitLab led to success)
We were able to setup and connect the cluster without much issue and the system has been running smoothly since
One thing we were not very happy with (although this is not directly related to this topic) is the cryosparcm cluster validate utility, which kept failing during our initial testing. The issue, it seems, was that the validation failed to properly populate the j2 templates in the submission script, while (as we later found out) it worked fine on the real system. For example, the slurm example submit script would render the following:
We had assumed that the validation utility would provide default values for all basic variables, but it seems defaults must be provided across the whole file. We did not look further into this however.
If there are further questions regarding the way we set this up, I will gladly do my best to answer.
quick update to the multi-gpu job:
Attached to a live session we have a 2D Classification job running right now, that is using 3 GPUs. It’s using multiple MIG devices in the slurm job, and nvidia-smi shows the job’s process an all 3 devices.
Thanks @ebirn. Are the 3 "GPU"s all MIG “sub-devices” on the same “physical” GPU device? Would you be willing to show nvidia-smi output on the GPU host to show