Jobs stacking on node

yodamoppet · November 7, 2022, 12:30am

Greetings.

I’ve gone though a few changes and a rollback recently outline in this thread:

Now that things are up and running again, we are finding that our jobs are stacking on a single node. This is odd, as we are using the same cluster_script.sh as before with only slight a slight modification to load the path to nvidia-smi.

Declarations in our slum script:

#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p defq
#SBATCH --mem={{ (ram_gb*1000)|int }}MB
#SBATCH -o {{ job_dir_abs }}/out.txt
#SBATCH -e {{ job_dir_abs }}/err.txt

(Then follows the section “available_devs=”“” as usual, so I will not paste it here to save space.)

I can confirm that outside of cryosparc, SLURM is working normally. For instance if I do a few sessions like:

srun --time=1:00:00 --gres=gpu:2 --pty /bin/bash

These srun test jobs land on a progression of nodes as would be expected based on available resources.

However, cryosparc jobs continue to land on the first node in the queue, even if there are no resources available on that node.

yodamoppet · November 7, 2022, 4:39pm

Greetings.

This has been resolved.

Upon inspection, it doesn’t look like the “cryosparcm cluster connect” command was properly updating the database to reflect the script.

After re-creating the script and updating the database, all appears well.