A bug I’ve found with v4.7.1-cuda12 is that by default it won’t use all SXM GPUs, even when selecting “Use all GPUs.” It uses one by default. You need to disable the “All GPUs” slider and manually specify using 8.
The install process detects all 8, but the web ui forces you to always specify 8. PCIe cards aren’t having the same issue. It’s specifically affecting DGX and HGX nodes with SXM cards.
On the benchmark test, if you select “Benchmark all available GPUs”, it always uses only a single GPU. However, if you manually specify, it will use all 8.
One thing to note is that when “Benchmark all available GPUs” is enabled, CryoSPARC only shows one GPU allocated in the job card, even though the job is actually benching all of them. For example, this job has all GPUs enabled:
The job’s event log should show that all GPUs are getting benchmarked. This discrepancy is a limitation of the current system. Could you verify whether the job’s event log only shows one GPU as well?
You’re right, it does actually benchmark every GPU. Is there any info on when we can see the UI fixed?
Another somewhat-related issue though is that when it moves on to a different GPU, it doesn’t release the old one. CryoSPARC hangs on to the previously benchmarked card until all 8 are done. Wouldn’t it make more sense to reserve all 8 from the start then (since we’re hanging on to the unused GPUs until the end)?
Please can you provide details of the use case where you found the current implementation limiting?
For Benchmark jobs of both an affected node and an unaffected node, please can you post each of the following:
output of the following command on the GPU node
a screenshot of the job’s appearance in the UI
output of the commands (on the CryoSPARC master host)
csprojectid=P99 # replace with actual project ID
csjobid=J199 # replace with actual job ID
cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'status', 'params_spec')"
Please can you provide details of the use case where you found the current implementation limiting?
In scenarios where organizations aren’t using a cluster system, if CryoSPARC starts an 8 GPU job, but doesn’t reserve all 8 GPUs at one time, another application (like Relion or alphafold) would see those other GPUs being free. This would result in a conflict once CryoSPARC tries to use the GPU another application is using.
I no longer have access to that specific CryoSPARC instance (as that was only for troubleshooting), but here is the output of nvidia-smi: