Thanks @kevinj for sharing these details.
I can see how setting CUDA_VISIBLE_DEVICES
inside a slurm script can fail at brokering GPU resources between multiple cluster jobs on the same host. May I suggest
- updating the CryoSPARC cluster script templates to remove any code block like
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z "$available_devs" ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
- instead isolating GPU resources between jobs using slurm cgroup settings. For an example, please see Slurm, GPU, CGroups, ConstrainDevices - #3 by dchin - Discussion Zone - ask.CI.