cryoSPARC v2 cluster install

jelka · August 22, 2018, 1:41pm

Hi,

Thank you for the release of cryoSPARC v2. I’ve been looking so much forward to this release.

Although I have a little problems installing the cluster setup.
I have come as far as installing the master and cluster worker, but when I try to connect the master with the cluster_info.json and cluster_script.sh it fails with:

On master:

-bash-4.2$ cryosparcm cluster connect
Traceback (most recent call last):
_ File “”, line 5, in _
_ File “/opt/bioxray/programs/cryosparc2/cryosparc2_master/deps/anaconda/lib/python2.7/json/init.py”, line 291, in load_
_ **kw)_
_ File “/opt/bioxray/programs/cryosparc2/cryosparc2_master/deps/anaconda/lib/python2.7/json/init.py”, line 339, in loads_
_ return default_decoder.decode(s)
_ File “/opt/bioxray/programs/cryosparc2/cryosparc2_master/deps/anaconda/lib/python2.7/json/decoder.py”, line 364, in decode_
_ obj, end = self.raw_decode(s, idx=w(s, 0).end())
_ File “/opt/bioxray/programs/cryosparc2/cryosparc2_master/deps/anaconda/lib/python2.7/json/decoder.py”, line 380, in raw_decode_
_ obj, end = self.scan_once(s, idx)_
ValueError: Expecting , delimiter: line 5 column 5 (char 142)

This is on CentOS 7.4 and the master node is non-GPU.
My cluster files looks like this:

-bash-4.2$ cat cluster_info.json
{
_ “name” : “EMCC”,_
_ “worker_bin_path” : “/opt/bioxray/programs/cryosparc2/cryosparc2_worker/bin/cryosparcw”,_
_ “cache_path” : “”_
_ “send_cmd_tpl” : “ssh loginnode {{ command }}”,_
_ “qsub_cmd_tpl” : “sbatch {{ script_path_abs }}”,_
_ “qstat_cmd_tpl” : “squeue -j {{ cluster_job_id }}”,_
_ “qdel_cmd_tpl” : “scancel {{ cluster_job_id }}”,_
_ “qinfo_cmd_tpl” : “sinfo”,_
_ “transfer_cmd_tpl” : “scp {{ src_path }} loginnode:{{ dest_path }}”_
}

-bash-4.2$ cat cluster_script.sh
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }} - the complete command string to run the job
## {{ num_cpu }} - the number of CPUs needed
_## {{ num_gpu }} - the number of GPUs needed. _
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## {{ ram_gb }} - the amount of RAM needed in GB
## {{ job_dir_abs }} - absolute path to the job directory
## {{ project_dir_abs }} - absolute path to the project dir
## {{ job_log_path_abs }} - absolute path to the log file for the job
## {{ worker_bin_path }} - absolute path to the cryosparc worker command
## {{ run_args }} - arguments to be passed to cryosparcw run
## {{ project_uid }} - uid of the project
## {{ job_uid }} - uid of the job
##
## What follows is a simple SLURM script:

#SBATCH --job-name cryosparc2{{ project_uid }}{{ job_uid }}
#SBATCH -n {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu
_#SBATCH --mem={{ (ram_gb*1000)|int }}MB _
#SBATCH -o {{ job_dir_abs }}
#SBATCH -e {{ job_dir_abs }}

available_devs=""
for devidx in $(seq 0 15);
do
_ if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then_
_ if [[ -z “$available_devs” ]] ; then_
_ available_devs=$devidx_
_ else_
_ available_devs=$available_devs,$devidx_
_ fi_
_ fi_
done
export CUDA_VISIBLE_DEVICES=$available_devs

{{ run_cmd }}

What am I doing wrong?

Cheers,
Jesper

jelka · August 22, 2018, 1:50pm

Okay, found the fix myself by adding a “,” to “cache_path” : “” in cluster_info.json

Now it at least load the config into “Compute Configuration” on the web page.

But I still fails to submit jobs.
I assume it send a sbatch command in from of the run-script, but I can see where I can tell it to do this?

Cheers,
Jesper

Xing · August 22, 2018, 2:23pm

In clusterinfo.json, the following line tells cluster how to submit a job:
“send_cmd_tpl” : “ssh loginnode {{ command }}”

It logins to the login node and submits the job. However, if you can submit a job from the master node, you can modify this line to “send_cmd_tpl” : “{{ command }}”

It works in my hand.

–Xing

jelka · August 22, 2018, 2:51pm

Thanks for input Xing, but that did not work for me.
It looks like the job is tried run on the master, as it is listed as the worker resource being used. Thus, the job is not even tried submitted through SLURM.

Cheers,
Jesper

Xing · August 22, 2018, 3:26pm

Just remember to select the right “lane” (the cluster lane instead of other worker lanes) when you “queue” and “create” the new job.

jelka · August 22, 2018, 5:17pm

Hmm, That is exactly what I already do.

I have also uninstalled everything and reinstalled it again. just to double check.
SLURM work fine from the master node and can submit without any problems outside cryoSPARCv2.

jelka · August 22, 2018, 6:31pm

Did the whole installation a third time and now it seems to work…
Jobs are now being being queued in SLURM.

Thanks again Xing.

Cheers,
Jesper