Our CryoSPARC software runs on a Slurm cluster. The cluster comprises 5 GPU servers, each equipped with 8 Nvidia RTX3090 graphics cards. CryoSPARC users do not submit tasks directly to a specific GPU server; instead, they submit tasks to the cluster, which are then distributed to a GPU server for computation by the Slurm cluster management software. Currently, the overall operation is satisfactory, but we have encountered the following two issues:
The first issue is that CryoSPARC-Slurm tends to distribute tasks across multiple GPU servers, resulting in each server having ongoing tasks and being unable to accept new ones, even when their computational resources are abundant. For example, if there are 5 CryoSPARC jobs, and each job only requires 2 GPUs, since each GPU server has 8 GPUs, it could theoretically accept 4 jobs. This would allow the 2nd to 5th GPU servers to still accept other tasks. However, currently, CryoSPARC-Slurm assigns tasks to all 5 GPU servers instead of assigning jobs 1 to 4 to the first GPU server. This leads to each server having 6 idle GPUs that cannot accept new tasks. This task scheduling approach does not fully utilize computational resources and instead results in a significant amount of idle and wasted resources. Is there an issue with our software configuration?
The second issue is that when no immediately available computational resources exist upon task submission, CryoSPARC runs squeue to check the queue progress, but an error message appears as follows. Why is this happening?
[{âcache_pathâ: â/scratch/CryoSPARC_Cacheâ, âcache_quota_mbâ: None, âcache_reserve_mbâ: 10000, âcustom_var_namesâ: , âcustom_varsâ: {}, âdescâ: None, âhostnameâ: âHY_slurmâ, âlaneâ: âHY_slurmâ, ânameâ: âHY_slurmâ, âqdel_cmd_tplâ: â/opt/slurm/24.11.0/bin/scancel {{ cluster_job_id }}â, âqinfo_cmd_tplâ: â/opt/slurm/24.11.0/bin/sinfoâ, âqstat_cmd_tplâ: â/opt/slurm/24.11.0/bin/squeue -j {{ cluster_job_id }}â, âqstat_code_cmd_tplâ: â/opt/slurm/24.11.0/bin/squeue -j {{ cluster_job_id }} --format=%T | sed -n 2pâ, âqsub_cmd_tplâ: â/opt/slurm/24.11.0/bin/sbatch {{ script_path_abs }}â, âscript_tplâ: â#!/bin/sh\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed.\n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n##
of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n## {{ job_type }} - CryoSPARC job type\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --partition=HY\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n\n{{ run_cmd }}\n\nâ, âsend_cmd_tplâ: â{{ command }}â, âtitleâ: âHY_slurmâ, âtpl_varsâ: [âcluster_job_idâ, âcommandâ, ârun_argsâ, âjob_uidâ, âcryosparc_usernameâ, âworker_bin_pathâ, âram_gbâ, âjob_creatorâ, ânum_gpuâ, âjob_dir_absâ, âproject_uidâ, âjob_typeâ, âproject_dir_absâ, ânum_cpuâ, ârun_cmdâ, âjob_log_path_absâ], âtypeâ: âclusterâ, âworker_bin_pathâ: â/hy003/Software/CryoSPARC_Yunlab/cryosparc_worker/bin/cryosparcwâ}]
However, âqstat_code_cmd_tplâ: â/opt/slurm/24.11.0/bin/squeue -j {{ cluster_job_id }} --format=%T | sed -n 2pâ actually is from the CryoSPARC v4.6.2 package (but not from our own script). Once we tar zxfv cryosparc_master.tar.gz
and
cd cryosparc_master/bin
$ grep squeue *
we see
cryosparcm: âqstat_cmd_tplâ : âsqueue -j {{ cluster_job_id }}â,
cryosparcm: âqstat_code_cmd_tplâ: âsqueue -j {{ cluster_job_id }} --format=%T | sed -n 2pâ,
I am not sure what I should do to this line in cryocparcm.
By the way, although the "squeue: error: Unrecognized option: | " is annoying, it seems not to interfere with the calculation. I think the first problem is a real problem â
The first issue is that CryoSPARC-Slurm tends to distribute tasks across multiple GPU servers, resulting in each server having ongoing tasks and being unable to accept new ones, even when their computational resources are abundant. For example, if there are 5 CryoSPARC jobs, and each job only requires 2 GPUs, since each GPU server has 8 GPUs, it could theoretically accept 4 jobs. This would allow the 2nd to 5th GPU servers to still accept other tasks. However, currently, CryoSPARC-Slurm assigns tasks to all 5 GPU servers instead of assigning jobs 1 to 4 to the first GPU server. This leads to each server having 6 idle GPUs that cannot accept new tasks. This task scheduling approach does not fully utilize computational resources and instead results in a significant amount of idle and wasted resources. Is there an issue with our software configuration?
I think the two issues have been resovled. The problems seem to stem from the bad cluster_info.json and cluster_script.sh files. I have tried the following and solved all the problems.
I deleted the old installation, and re-install CryoSPARC v4.6.2