Greetings.
I’ve gone though a few changes and a rollback recently outline in this thread:
Now that things are up and running again, we are finding that our jobs are stacking on a single node. This is odd, as we are using the same cluster_script.sh as before with only slight a slight modification to load the path to nvidia-smi.
Declarations in our slum script:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p defq
#SBATCH --mem={{ (ram_gb*1000)|int }}MB
#SBATCH -o {{ job_dir_abs }}/out.txt
#SBATCH -e {{ job_dir_abs }}/err.txt
(Then follows the section “available_devs=”"" as usual, so I will not paste it here to save space.)
I can confirm that outside of cryosparc, SLURM is working normally. For instance if I do a few sessions like:
srun --time=1:00:00 --gres=gpu:2 --pty /bin/bash
These srun test jobs land on a progression of nodes as would be expected based on available resources.
However, cryosparc jobs continue to land on the first node in the queue, even if there are no resources available on that node.