Slurm settings for cryosparc v2

open

#1

Hi, we are installing cryosparc v2 in our clusters. Our cluster use slurm to assign and submit job to nodes. However, we found we cannot use default setting start cryosparc in multiple node. I have connected settings to cryosparc v2 master already, but when we use the lane, we got an error: " Command ‘[‘sbatch’, ‘/home/test/cryosparc2_project2/P2/J6/queue_sub_script.sh’]’ returned non-zero exit status 1"
Here is cryosparc v2 slurm setting files:
cluster_info.json:
{
“name” : “slurm2”,
“worker_bin_path” : “/cm/shared/apps/cryosparc2/cryosparc2_worker/bin/cryosparcw”,
“cache_path” : “/ssd/cryosparc2_cache”,
“send_cmd_tpl” : “{{ command }}”,
“qsub_cmd_tpl” : “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl” : “squeue -j {{ cluster_job_id }}”,
“qdel_cmd_tpl” : “scancel {{ cluster_job_id }}”,
“qinfo_cmd_tpl” : “sinfo”,
“transfer_cmd_tpl” : “scp {{ src_path }} loginnode:{{ dest_path }}”
}
cluster_script.sh
#!/usr/bin/env bash

cryoSPARC cluster submission script template for SLURM

Available variables:
{{ run_cmd }} - the complete command string to run the job
{{ num_cpu }} - the number of CPUs needed
{{ num_gpu }} - the number of GPUs needed.
Note: the code will use this many GPUs starting from dev id 0
the cluster scheduler or this script have the responsibility
of setting CUDA_VISIBLE_DEVICES so that the job code ends up
using the correct cluster-allocated GPUs.
{{ ram_gb }} - the amount of RAM needed in GB
{{ job_dir_abs }} - absolute path to the job directory
{{ project_dir_abs }} - absolute path to the project dir
{{ job_log_path_abs }} - absolute path to the log file for the job
{{ worker_bin_path }} - absolute path to the cryosparc worker command
{{ run_args }} - arguments to be passed to cryosparcw run
{{ project_uid }} - uid of the project
{{ job_uid }} - uid of the job
{{ job_creator }} - name of the user that created the job (may contain spaces)
{{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)

What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n 2
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p defq
#SBATCH --mem={{ (ram_gb*1000)|int }}MB
#SBATCH -o {{ job_dir_abs }}
#SBATCH -e {{ job_dir_abs }}

available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z “$available_devs” ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

{{ run_cmd }}


#2

The cluster settings look similar to ours. Can you post the job outputs? Or maybe try to submit the script with sbatch yourself from a shell to identify errors.


#3

I have delected double # in the script. Since it looks as bold in the bbs.
Here is our output in cryosparc v2:
====================== Cluster submission script: ========================

#!/usr/bin/env bash
cryoSPARC cluster submission script template for SLURM
Available variables:
/cm/shared/apps/cryosparc2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J7 --master_hostname headnode --master_command_core_port 39002 > /home/test/cryosparc2_project2/P2/J7/job.log 2>&1 - the complete command string to run the job
48 - the number of CPUs needed
8 - the number of GPUs needed.
Note: the code will use this many GPUs starting from dev id 0
the cluster scheduler or this script have the responsibility
of setting CUDA_VISIBLE_DEVICES so that the job code ends up
using the correct cluster-allocated GPUs.
120.0 - the amount of RAM needed in GB
/home/test/cryosparc2_project2/P2/J7 - absolute path to the job directory
/home/test/cryosparc2_project2/P2 - absolute path to the project dir
/home/test/cryosparc2_project2/P2/J7/job.log - absolute path to the log file for the job
/cm/shared/apps/cryosparc2/cryosparc2_worker/bin/cryosparcw - absolute path to the cryosparc worker command
–project P2 --job J7 --master_hostname headnode --master_command_core_port 39002 - arguments to be passed to cryosparcw run
P2 - uid of the project
J7 - uid of the job
What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_P2_J7
#SBATCH -n 48
#SBATCH --gres=gpu:8
#SBATCH -p defq
#SBATCH --mem=120000MB
#SBATCH -o /home/test/cryosparc2_project2/P2/J7
#SBATCH -e /home/test/cryosparc2_project2/P2/J7

available_devs=""
for devidx in $(seq 0 15);
do
if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
if [[ -z “$available_devs” ]] ; then
available_devs=$devidx
else
available_devs=$available_devs,$devidx
fi
fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

/cm/shared/apps/cryosparc2/cryosparc2_worker/bin/cryosparcw run --project P2 --job J7 --master_hostname headnode --master_command_core_port 39002 > /home/test/cryosparc2_project2/P2/J7/job.log 2>&1

==========================================================================

-------- Submission command: sbatch /home/test/cryosparc2_project2/P2/J7/queue_sub_script.sh
Failed to launch! 1
And Here is output by “sbatch queue_sub_script.sh”
sbatch: error: Batch job submission failed: Requested node configuration is not available


#4

I noted
#SBATCH --gres=gpu:8
and the job could launch.
However, the job can only run on headnode, and not assigned to other node.
When I use other nodes queue to run job, it was launched and job PD, but in slurm, the job failed.
Help?


#5

It looks like your gres or mem requests exceed those available on the cluster. Are you sure you have nodes with 8 GPU/node? And the memory available to slurm may be less than the total installed memory, so check that 120GB is correct.

Probably your cluster administrator could help you, since this question is more about slurm than cryosparc.