Error populating num_gpu variable in Extensive Validation job

smedury · November 6, 2023, 11:36pm

Hi,

We have a CryoSPARC master instance configured to submit jobs to our SLURM cluster. We tested a simple multi-gpu motion correction job using the empiar dataset and it was completed successfully. However, when we run the Extensive Validation job, the num_gpu parameter gets populated as 0 no matter what value I provide to the “Number of GPUs to use” in the Job Builder menu. Please help.

This a fresh installation. Version – v4.3.1

Thanks,
Sai

wtempel · November 7, 2023, 5:49pm

Welcome to the forum @smedury.
Did you run Extensive Validation in Run Mode "Testing: or “Benchmark”?
Please can you post the output of the command

cryosparcm cli "get_scheduler_targets()"

smedury · November 8, 2023, 3:04pm

Hi,

Thanks for the response. I tried running in Testing mode first then also tried the Benchmark mode. I received the same error in both modes.

Here’s the output:

$ cryosparcm cli "get_scheduler_targets()" | grep num_gp
[{'cache_path': '/home/cryosparc_user/non_ssd_cache/', 'cache_quota_mb': None, 'cache_reserve_mb': 100000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'accre_non_gpu', 'lane': 'accre_non_gpu', 'name': 'accre_non_gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --account=csb\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --mem={{ (ram_gb*1000)|int }}M\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'accre_non_gpu', 'tpl_vars': ['run_cmd', 'cluster_job_id', 'project_uid', 'ram_gb', 'job_log_path_abs', 'num_cpu', 'job_uid', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparcuser/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/csbtmp/cryosparc/ssd_data/', 'cache_quota_mb': None, 'cache_reserve_mb': 100000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'accre gpu', 'lane': 'accre gpu', 'name': 'accre gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --partition=a6000x4\n#SBATCH --account=csb_gpu_acc\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --mem={{ (ram_gb*1000)|int }}M\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --gres-flags=enforce-binding\n\nsrun echo $CUDA_VISIBLE_DEVICES\nsrun {{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'accre_gpu', 'tpl_vars': ['run_cmd', 'cluster_job_id', 'project_uid', 'ram_gb', 'num_gpu', 'job_log_path_abs', 'num_cpu', 'job_uid', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparcuser/cryosparc_worker/bin/cryosparcw'}]

wtempel · November 8, 2023, 8:36pm

This is expected. If the Extensive Validation job failed, what error messages did you observe in the

Event Log
job log (under Metadata|Log)?

If the Extensive Validation job did not fail and validation spawned a Patch Motion Correction job, please can you post queue_sub_script.sh from the motion correction job’s directory.

smedury · November 9, 2023, 5:17pm

Hi,

Thanks for the response. Our SLURM configuration won’t allow GPU:0 in GPU partition and for this reason, we established two separate lanes for GPU and Non-GPU in our instance.

Our job submission fails with the following error.

sbatch: error: You need to request at least one GPU for jobs in the a6000x4 partition
sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

I just resubmitted this job to the non-gpu partition to see if it works there but I was also hoping to benchmark with GPUs if possible. Is there any workaround to forcing num_gpu value in the Extensive Validation job?

wtempel · November 9, 2023, 5:44pm

I see. Why not try (I have not tried myself) defining script_tpl, via cluster_script.sh such that

#SBATCH --partition=

is defined depending on {{ num_gpu }}, using syntax similar to Using job_type in slurm if statement.
Alternatively, I also read that the --partition= parameter allows a comma-separated list, but I do not know whether the mere inclusion of a “valid” (for the job’s resource spec) partition in a list would allow you to avoid

`

smedury · November 9, 2023, 8:29pm

Defining script_tpl with an if condition worked.

I didn’t notice this example cluster_script.sh before: CryoSPARC Cluster Integration Script Examples - CryoSPARC Guide

Thanks again for all the help.