Cryosparc Cluster Installation

Hi all,

Over the weekend, I have been trying to get cryosparc to work well with our local cluster. Right now, my jobs are submitted via the login node and nothing happens after that.

Cluster nodes have a separate worker folder and they are using cuda 10.2

Here is the current configuration:

{
    "name": "nogales",
    "title": "nogales",
    "worker_bin_path": "/cryosparc/worker/cryosparc_cluster/cryosparc_worker/bin",
    "send_cmd_tpl": "ssh whale {{ command }}",
    "qsub_cmd_tpl": "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl": "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl": "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl": "sinfo",
    "cache_path": "/volatile/cryosparc",
    "cache_quota_mb": null,
    "cache_reserve_mb": 10000
}

#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }}            - the complete command string to run the job
## {{ num_cpu }}            - the number of CPUs needed
## {{ num_gpu }}            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## {{ ram_gb }}             - the amount of RAM needed in GB
## {{ job_dir_abs }}        - absolute path to the job directory
## {{ project_dir_abs }}    - absolute path to the project dir
## {{ job_log_path_abs }}   - absolute path to the log file for the job
## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
## {{ run_args }}           - arguments to be passed to cryosparcw run
## {{ project_uid }}        - uid of the project
## {{ job_uid }}            - uid of the job
## {{ job_creator }}        - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu
#SBATCH --mem={{ (ram_gb*1000)|int }}MB             
#SBATCH -o {{ job_dir_abs }}
#SBATCH -e {{ job_dir_abs }}

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

srun {{ run_cmd }}

Here is the actual submit script that’s generated and the corresponding error.

====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /cryosparc/worker/cryosparc_cluster/cryosparc_worker/bin run --project P36 --job J100 --master_hostname albakor --master_command_core_port 39002 > /cryosparc/projects/abhiram/P36/J100/job.log 2>&1             - the complete command string to run the job
## 12            - the number of CPUs needed
## 2            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 32.0             - the amount of RAM needed in GB
## /cryosparc/projects/abhiram/P36/J100        - absolute path to the job directory
## /cryosparc/projects/abhiram/P36    - absolute path to the project dir
## /cryosparc/projects/abhiram/P36/J100/job.log   - absolute path to the log file for the job
## /cryosparc/worker/cryosparc_cluster/cryosparc_worker/bin    - absolute path to the cryosparc worker command
## --project P36 --job J100 --master_hostname albakor.qb3.berkeley.edu --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## P36        - uid of the project
## J100            - uid of the job
## abhiram        - name of the user that created the job (may contain spaces)
## achintangal@berkeley.edu - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_P36_J100
#SBATCH -n 12
#SBATCH --gres=gpu:2
#SBATCH -p gpu
#SBATCH --mem=32000MB             
#SBATCH -o /cryosparc/projects/abhiram/P36/J100
#SBATCH -e /cryosparc/projects/abhiram/P36/J100

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

srun /cryosparc/worker/cryosparc_cluster/cryosparc_worker/bin run --project P36 --job J100 --master_hostname albakor  --master_command_core_port 39002 > /cryosparc/projects/abhiram/P36/J100/job.log 2>&1 


==========================================================================
==========================================================================
-------- Submission command: 
ssh whale.qb3.berkeley.edu sbatch /cryosparc/projects/abhiram/P36/J100/queue_sub_script.sh
-------- Cluster Job ID: 
650
-------- Queued on cluster at 2021-04-07 17:32:08.466015
-------- Job status at 2021-04-07 17:32:08.824638
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               650       gpu cryospar cryospar PD       0:00      1 (None)


cryosparcm joblog p36 j100 
Traceback (most recent call last):
  File "/opt/cryosparc-v2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/cryosparc-v2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_compute/client.py", line 86, in <module>
    print(eval("cli."+command))
  File "<string>", line 1, in <module>
  File "/opt/cryosparc-v2/cryosparc2_master/cryosparc_compute/client.py", line 59, in func
    assert False, res['error']
AssertionError: {'code': 500, 'data': None, 'message': "OtherError: argument of type 'NoneType' is not iterable", 'name': 'OtherError'}

I’d appreciate any pointers to this. Once I am back home today, I will try manually submitting the script as cryosparc user and see what’s going on.

Thanks!

1 Like

Hi @achintangal
Do you have multiple nodes for submitting jobs? (like one queue for GPUs, one for high RAM, etc.)
Could you run sinfo -N and write the output here?
A quick and dirty way to come around this is adding an extra line for the name of the node that you want to run the job on in your cluster_script.sh file, retry cryosparcm cluster connect, and maybe add a new lane for individual nodes, if you want to use different nodes for different jobs. I use PBS and we have multiple queues for every node, and making lanes for individual nodes worked for me.

Hope this helps.
Arunabh