Hmmm, I have this in our cluster_script.sh, but I’m still seeing two jobs land on the same GPU, confirmed with nvidia-smi during job runs.
Here is cluster_script.sh:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p defq
#SBATCH --mem={{ (ram_gb*1000)|int }}MB
#SBATCH -o {{ job_dir_abs }}/out.txt
#SBATCH -e {{ job_dir_abs }}/err.txt
available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z “$available_devs” ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs
{{ run_cmd }}
Based on this, I still don’t see why two jobs would land on the same GPU. What else can I check?