Cluster option not available since v4.4

orangeboomerang · March 26, 2024, 12:51pm

hi all,

we were running v4.2 on our cluster without issue, then upgraded to v4.4. This required that we upgrade our GPU nodes to debian12. With the fresh install of v4.4 we set up everything as usual: using a gpu as a “head node” then running “cryosparcm cluster connect” in the folder with the .json and .sh files to configure the cluster. This is as we have always done. However when running jobs there is no option for cluster, so it is not connecting. Has something changed in v4.4 for how to set up cluster access?

thanks

Jesse Hansen

nwong · March 27, 2024, 8:04pm

Hi @orangeboomerang,

Could you please:

re-run cluster connect and paste the output
as an admin user, check the UI Instance tab to see if the cluster is listed there
as an admin user, check the UI Lane restrictions tab in Admin panel to see if the cluster is hidden from some users

orangeboomerang · March 28, 2024, 6:07pm

hi,

When the admin of our cluster does cryosparcm cluster connect he gets a different output than me. Below is the output he obtains. Note that gpu136 is the cryosparc head node. To me, it looks like cluster connect is connecting only the head node (gpu136) rather than looking for a cluster_submit.sh and config json file to connect the cluster.

Note that I’ve replaced actual server names with (path).


---------------------------------------------------------------

CRYOSPARC CONNECT --------------------------------------------

---------------------------------------------------------------

Attempting to register worker gpu136.(path).local to command gpu136.(path).local:63002

Connecting as unix user cryoschurgrp

Will register using ssh string: cryogrp@gpu136.(path).local

If this is incorrect, you should re-run this command with the flag --sshstr <ssh string>

---------------------------------------------------------------

Connected to master.

---------------------------------------------------------------

Current connected workers:

gpu136.(path).local

---------------------------------------------------------------

Worker will be registered with 56 CPUs.

Autodetecting available GPUs...

Detected 10 CUDA devices.

id pci-bus name

---------------------------------------------------------------

0 4 NVIDIA GeForce GTX 1080 Ti

1 5 NVIDIA GeForce GTX 1080 Ti

2 6 NVIDIA GeForce GTX 1080 Ti

3 7 NVIDIA GeForce GTX 1080 Ti

4 8 NVIDIA GeForce GTX 1080 Ti

5 11 NVIDIA GeForce GTX 1080 Ti

6 12 NVIDIA GeForce GTX 1080 Ti

7 13 NVIDIA GeForce GTX 1080 Ti

8 14 NVIDIA GeForce GTX 1080 Ti

9 15 NVIDIA GeForce GTX 1080 Ti

---------------------------------------------------------------

All devices will be enabled now.

This can be changed later using --update

---------------------------------------------------------------

Worker will be registered with SSD cache location /*(path)*/v4.4_PORT63000

---------------------------------------------------------------

Autodetecting the amount of RAM available...

This machine has 515.86GB RAM .

---------------------------------------------------------------

ERROR: This hostname is already registered! Remove it first.

however, I also have been given access to gpu136 and I generally am responsible for making user accounts, restarting cryosparc etc. so I guess that also makes me an admin of some sort. When I try the command I get the error below:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "(path)/cryosparc_v4.4_gpu62Master_gpu136Worker_PORT63000_LIC_92d54c20/cryosparc_master/cryosparc_compute/cluster.py", line 36, in connect
    target = cli.add_scheduler_target_cluster(**cluster_info)
  File "(path)/cryosparc_v4.4_gpu62Master_gpu136Worker_PORT63000_LIC_92d54c20/cryosparc_master/cryosparc_tools/cryosparc/command.py", line 121, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://gpu136.(path).local:63002, code 400) Encountered ServerError from JSONRPC function "add_scheduler_target_cluster" with params {'name': 'slurmcluster', 'worker_bin_path': '(path)/cryosparc_v4.2.1_gpu62Master_gpu118Worker_PORT58000_LIC_00613eea/cryosparc_worker/bin/cryosparcw', 'cache_path': '(path)/v4.2.1_PORT58000', 'send_cmd_tpl': '{{ command }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'transfer_cmd_tpl': 'scp {{ src_path }} loginnode:{{ dest_path }}', 'script_tpl': '#!/bin/bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed.\n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n##\n## What follows is a simple SLURM script:\n\n\n#SBATCH --job-name cs_{{ num_gpu }}_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem=40000MB\n#SBATCH -o (path)/cryosparc_slurm_outputs/output_{{ project_uid }}_{{ job_uid }}.txt\n#SBATCH -e (path)/cryosparc_slurm_outputs/error_{{ project_uid }}_{{ job_uid }}.txt\n#SBATCH --exclude=gpu145\n#SBATCH --time=240:00:00\n#SBATCH --partition=gpu\n#SBATCH --constraint=bookworm # debian12\n\necho $available_devs\necho $CUDA_HOME\necho "$(hostname)"\necho $SLURM_TMPDIR\n\n/usr/bin/nvidia-smi\n\nmodule list\n\nexport CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"\n\n{{ run_cmd }}\n\n'}:
ServerError: add_scheduler_target_cluster() got an unexpected keyword argument 'transfer_cmd_tpl'
Traceback (most recent call last):
  File (path)/cryosparc_v4.4_gpu62Master_gpu136Worker_PORT63000_LIC_92d54c20/cryosparc_master/cryosparc_command/commandcommon.py", line 195, in wrapper
    res = func(*args, **kwargs)
TypeError: add_scheduler_target_cluster() got an unexpected keyword argument 'transfer_cmd_tpl'

So for me it outputs basically the full cluster_submit.sh script and then a bunch of errors.

the instance tab only has one entry “Lane default (node)”
which is gpu136 node with the 10 GPUs.
There are no lane restrcitions: “Admin user, no lane restrictions applied”

thanks again for your help.

Jesse

orangeboomerang · March 28, 2024, 6:15pm

here is the json script:

{
    "name" : "slurmcluster",
    "worker_bin_path" : "/nfs/scistore14/schurgrp/cryoschurgrp/cryosparc_v4.2.1_gpu62Master_gpu118Worker_PORT58000_LIC_00613eea/cryosparc_worker/bin/cryosparcw",
    "cache_path" : "/ssdpool/cryoschurgrp/v4.2.1_PORT58000",
    "send_cmd_tpl" : "{{ command }}",
    "qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo",
    "transfer_cmd_tpl" : "scp {{ src_path }} loginnode:{{ dest_path }}"
}

and the cluster_submit.sh:

#SBATCH --job-name cs_{{ num_gpu }}_{{ project_uid }}_{{ job_uid }}

#SBATCH -n {{ num_cpu }}

#SBATCH --gres=gpu:{{ num_gpu }}

#SBATCH --mem=40000MB

#SBATCH -o (path)/cryosparc_slurm_outputs/output_{{ project_uid }}_{{ job_uid }}.txt

#SBATCH -e  (path)/cryoschurgrp/cryosparc_slurm_outputs/error_{{ project_uid }}_{{ job_uid }}.txt

#SBATCH --exclude=gpu145

#SBATCH --time=240:00:00

#SBATCH --partition=gpu

#SBATCH --constraint=bookworm 

echo $available_devs

echo $CUDA_HOME

echo "$(hostname)"

echo $SLURM_TMPDIR

/usr/bin/nvidia-smi

module list

export CRYOSPARC_SSD_PATH="${SLURM_TMPDIR}"

{{ run_cmd }}

orangeboomerang · March 28, 2024, 6:32pm

okay I actually made progress here. The json formatting changed between the two versions apparently. I noticed that the new json does not have “transfer_cmd_tpl”. So I copied over the new json and the cluster_submit.sh. I did cluster connect and it worked. I now see the cluster while submiting jobs!

however when I submit the job it fails with:

Cluster script submission for P3 J2 failed with exit code 255
ssh: Could not resolve hostname loginnode: Name or service not known

edit: figured it out. Solution was to change:
"send_cmd_tpl": "ssh loginnode {{ command }}",
to this:
"send_cmd_tpl": "{{ command }}",

Problem resolved!

hopefully this helps someone one day!

thanks

Jesse