CUDA_ERROR_NO_DEVICE - but only when AF2 is running!

wtempel · April 4, 2024, 3:13pm

Thanks @kevinj for sharing these details.

I can see how setting CUDA_VISIBLE_DEVICES inside a slurm script can fail at brokering GPU resources between multiple cluster jobs on the same host. May I suggest

updating the CryoSPARC cluster script templates to remove any code block like

available_devs=""
for devidx in $(seq 0 15);
do
 if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
 if [[ -z "$available_devs" ]] ; then
 available_devs=$devidx
 else
 available_devs=$available_devs,$devidx
 fi
 fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

instead isolating GPU resources between jobs using slurm cgroup settings. For an example, please see Slurm, GPU, CGroups, ConstrainDevices - #3 by dchin - Discussion Zone - ask.CI.