Lanes added after update do not submit jobs. Cluster configuration

Dears,

I have an issue with our cluster configuration. We have updated to version 4.4.1 a while ago. Everything was running smoothly. We got some new GPU nodes which I wanted to add. Therefore, I adapted my template cluster_info.json and cluster_script.sh and executed cryosparcm cluster connect as always resulting in the following error (full error via PM):

TypeError: add_scheduler_target_cluster() got an unexpected keyword argument 'transfer_cmd_tpl'

After deleting the line (aparently deprecated)

"transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}"

cryosparcm cluster connect run successfully.
The jobs, however, are not getting submitted to the cluster anymore using the new lane.
Can somebody explain the behavior or is there a workaround for the new version?
Thanks a lot!
Here are the template cluster_files:
cluster_info.json

{
    "name" : "node:palma.uni-muenster.de",
    "worker_bin_path" : "placeholder",
    "cache_path" : "placeholder",
    "send_cmd_tpl" : "ssh -i ~/.ssh/key username@placeholder.de {{ command }}",
    "qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo",
    "transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}"
}

cluster_script_sh

#!/usr/bin/env bash
#SBATCH --job-name cryosparc_{{ job_creator }}_{{ project_uid }}_{{ job_uid }}
#SBATCH -t {{ time_usr }}
#SBATCH -n {{ num_cpu*4 }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu2080,gpu3090,gpua100,gpuv100,gputitanrtx
#SBATCH --mem={{ (ram_gb|float * ram_gb_multiplier|float)|int }}G             
#SBATCH -o {{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}.out
#SBATCH -e {{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}.err

ml 2021b
ml CUDA/11.6.0

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

{{ run_cmd }}

Best regards

Please can you post additional information:

  1. the output of the command
    cryosparcm cli "get_scheduler_targets()"
    
    and indicate the names of the lanes that function and malfunction, respectively.
  2. Are the project directories available (under matching paths) on the new GPU nodes?
  3. Post any messages related to the jobs’ submission failures from logs or the UI, such as from
    • jobs’ Event Logs
    • command_core log
      cryosparcm log command_core
    • slurm logs

Possibly unrelated to your immediate concern of jobs not being submitted:

Are you sure these modules are needed for (and do not interfere with) CryoSPARC jobs, as CryoSPARC v4.4 includes its own CUDA dependencies.

This code block may fail to properly assign/restrict GPU resources to jobs. For a more robust alternative, consider CUDA_ERROR_NO_DEVICE - but only when AF2 is running! - #9 by wtempel.

1 Like

Dear wtempel,

thank you for your fast reply and great questions with which I was able to figure it out myself.
The issues were the variables which was not that obvious but indicated by the command_core log.
After updating, the defaults for the variable were deleted since I did not hardcode them.
So I changed

#SBATCH --mem={{ (ram_gb|float * ram_gb_multiplier|float)|int }}G 

to

#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G

and now it is working again.

I think they are not needed anymore, I will remove them now.

Thank you for the hint. So far, there were no problems. However, slurm properly assigns GPU ressources which probably makes the code block unneccessary in the future. I removed it now.
Thanks a lot!

Edit: I was noticing that it is not straight forward to setup up a multi-user cluster integration for cryosparc. We figured out a system where each user is able to submit cryosparc jobs with their own slurm user using a single master instance for everybody. In case it is interesting for others I could write a short documentation on how we set everything up. Just let me know.

Best regards

2 Likes

Thanks for this kind offer @mruetter . I think this would be interesting.