Lanes added after update do not submit jobs. Cluster configuration

mruetter · April 18, 2024, 6:56am

Dears,

I have an issue with our cluster configuration. We have updated to version 4.4.1 a while ago. Everything was running smoothly. We got some new GPU nodes which I wanted to add. Therefore, I adapted my template cluster_info.json and cluster_script.sh and executed cryosparcm cluster connect as always resulting in the following error (full error via PM):

TypeError: add_scheduler_target_cluster() got an unexpected keyword argument 'transfer_cmd_tpl'

After deleting the line (aparently deprecated)

"transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}"

cryosparcm cluster connect run successfully.
The jobs, however, are not getting submitted to the cluster anymore using the new lane.
Can somebody explain the behavior or is there a workaround for the new version?
Thanks a lot!
Here are the template cluster_files:
cluster_info.json

{
    "name" : "node:palma.uni-muenster.de",
    "worker_bin_path" : "placeholder",
    "cache_path" : "placeholder",
    "send_cmd_tpl" : "ssh -i ~/.ssh/key username@placeholder.de {{ command }}",
    "qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo",
    "transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}"
}

cluster_script_sh

#!/usr/bin/env bash
#SBATCH --job-name cryosparc_{{ job_creator }}_{{ project_uid }}_{{ job_uid }}
#SBATCH -t {{ time_usr }}
#SBATCH -n {{ num_cpu*4 }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu2080,gpu3090,gpua100,gpuv100,gputitanrtx
#SBATCH --mem={{ (ram_gb|float * ram_gb_multiplier|float)|int }}G             
#SBATCH -o {{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}.out
#SBATCH -e {{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}.err

ml 2021b
ml CUDA/11.6.0

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

{{ run_cmd }}

Best regards

wtempel · April 18, 2024, 2:20pm

Please can you post additional information:

the output of the command
```
cryosparcm cli "get_scheduler_targets()"
```
and indicate the names of the lanes that function and malfunction, respectively.
Are the project directories available (under matching paths) on the new GPU nodes?
Post any messages related to the jobs’ submission failures from logs or the UI, such as from
- jobs’ Event Logs
- command_core log
  cryosparcm log command_core
- slurm logs

Possibly unrelated to your immediate concern of jobs not being submitted:

Are you sure these modules are needed for (and do not interfere with) CryoSPARC jobs, as CryoSPARC v4.4 includes its own CUDA dependencies.

mruetter:

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

This code block may fail to properly assign/restrict GPU resources to jobs. For a more robust alternative, consider CUDA_ERROR_NO_DEVICE - but only when AF2 is running! - #9 by wtempel.

mruetter · April 19, 2024, 7:34am

Dear wtempel,

thank you for your fast reply and great questions with which I was able to figure it out myself.
The issues were the variables which was not that obvious but indicated by the command_core log.
After updating, the defaults for the variable were deleted since I did not hardcode them.
So I changed

#SBATCH --mem={{ (ram_gb|float * ram_gb_multiplier|float)|int }}G

to

#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G

and now it is working again.

I think they are not needed anymore, I will remove them now.

Thank you for the hint. So far, there were no problems. However, slurm properly assigns GPU ressources which probably makes the code block unneccessary in the future. I removed it now.
Thanks a lot!

Edit: I was noticing that it is not straight forward to setup up a multi-user cluster integration for cryosparc. We figured out a system where each user is able to submit cryosparc jobs with their own slurm user using a single master instance for everybody. In case it is interesting for others I could write a short documentation on how we set everything up. Just let me know.

Best regards

wtempel · April 19, 2024, 1:01pm

Thanks for this kind offer @mruetter . I think this would be interesting.

tosyl · September 18, 2024, 8:37am

I’m interested in learning more about how you set up a multi-user CryoSPARC cluster integration using a single master instance. I’ve seen your previous comment about this, and I was wondering if you’ve had a chance to write that documentation yet. If so, I’d be grateful if you could share it!

ltf · September 18, 2024, 12:37pm

Can you give us a detailed description of how you set it up so that multiple users can use the same instance and the corresponding user can be displayed?

mruetter · September 23, 2024, 3:43pm

Hi Tosyl and ltf,

sorry I forgot about that/ and went too far down in my current to do list. I can give you now a short description and a detailed later (probably end of next week).

So, we installed the cryosparc master instance on a local server in our basement.
For computing, we are using our HPC cluster which is in another building due to security reasons.
There we have 40PB of cloud storage (long term), ~3PB scratch storage (short to mid term) where all of our gpu nodes have rwx access.

Basically, all cryosparc users from our group need an HPC account and an account in cryosparc with the same username.
Master instance and HPC cluster have the same unix account (identical name, group and id)
all cryosparc users are member of the same unix group (fruits)
The sudoes list needs to be edited to allow the master instance account (boss) to submit jobs for their group members
boss ALL=(fruits) NOPASSWD: /usr/bin/sbatch, /usr/bin/scancel, /usr/bin/squeue, /usr/bin/sinfo
scratch and cloud spaces are mounted via a samba export on our master instance with the same path like on our HPC cluster.
the project directories are assigned to the users via chown (for quota calculation) but have rwx access for the boss account (for the master instance to access it).
To submit jobs in the name of the users, the cluster_info.json needs to be edited to:

"qsub_cmd_tpl" : "sudo -u {{ job_creator }} sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "sudo -u {{ job_creator }} squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "sudo -u {{ job_creator }} scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sudo -u {{ job_creator }} sinfo"

where job_creator is the unix user on the HPC Cluster/cryosparc username (not the unix username under which the master instance is running).
I hope it already gives you an idea/ helps you with your setup.
Best,
Max

ltf · September 27, 2024, 9:09am

Thank you for sharing, but I believe that the sudoers file configuration is more than that

Because if you submit a task in this way, it will only retain some common variables such as /usr/bin, which will prompt that the slurm configuration file cannot be found. How did you do it?

mruetter · October 1, 2024, 1:44pm

Can you explain your question a bit more? We have slurm running on our HPC cluster but not on our management server since it is not necessary.

ltf · October 8, 2024, 1:29am

Because in /etc/sudoer, only the environment variables specified in the file are configured

There are no problems with the previous ones. I also added the environment variables of our test cluster slurm in the sudoer file and submitted it, but the slurm configuration file content cannot be found, as shown in the figure

wtempel · October 11, 2024, 2:17pm

Using sudo in the cluster script template somewhat deviates from the intended operation of CryoSPARC under a non-privileged account (1, 2).
Please weigh any security implication carefully before deciding whether to use sudo within the script template.

mruetter · October 23, 2024, 12:39pm

I totally agree. This configuration should only be used if you know what you are doing. However, it does not allow for root rights. The allowed commands should be limited and only applied to a small group.