CUDA_ERROR_NO_DEVICE - but only when AF2 is running!

Thanks @kevinj for sharing these details.

I can see how setting CUDA_VISIBLE_DEVICES inside a slurm script can fail at brokering GPU resources between multiple cluster jobs on the same host. May I suggest

  • updating the CryoSPARC cluster script templates to remove any code block like
available_devs=""
for devidx in $(seq 0 15);
do
 if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
 if [[ -z "$available_devs" ]] ; then
 available_devs=$devidx
 else
 available_devs=$available_devs,$devidx
 fi
 fi
done
export CUDA_VISIBLE_DEVICES=$available_devs