CryoSPARC Live jobs fail to start on slurm submission

EdLowe · March 18, 2024, 10:18am

Hello, we are running CryoSPARC 4.4.1 on a cluster using slurm for queue control. For normal interactive processing it is working well and use of lanes to submit jobs to slurm queues is working correctly.

However, on trying to run a CryoSPARC Live session the session appears to start normally but jobs fail as soon as they are submitted to a queue by slurm. An example of the error text reported by such a job is below - has anyone encountered and solved a similar problem?

Traceback (most recent call last): File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func with make_json_request(self, "/api", data=data, _stacklevel=4) as request: File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen)
File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 225, in make_request raise CommandError(error_reason, url=url, code=code, data=resdata) cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30273/api, code 500) 
URL Error [Errno 111] Connection refused 
The above exception was the direct cause of the following exception: Traceback (most recent call last): File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main File "cryosparc_master/cryosparc_compute/jobs/rtp_workers/run.py", line 323, in cryosparc_master.cryosparc_compute.jobs.rtp_workers.run.rtp_worker File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_compute/jobs/rtp_workers/rtp_common.py", line 24, in get_rtp_cli rtp = client.CommandClient(sysinfo['master_hostname'], int(sysinfo['port_command_rtp']), service="command_rtp") File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_compute/client.py", line 38, in __init__ super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder) File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 96, in __init__ self._reload() # attempt connection immediately to gather methods File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 134, in _reload system = self._get_callable("system.describe")() File "/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func raise CommandError( cryosparc_tools.cryosparc.errors.CommandError: *** (http://cryosparc.cosmic:30273, code 500) Encounted error from JSONRPC function "system.describe" with params ()

wtempel · March 18, 2024, 2:49pm

Welcome to the forum @EdLowe.

You may encounter this error because the cluster compute node cannot access a required port on the CryoSPARC master host.
For additional troubleshooting suggestions, please can you post additional information:

in which file you encountered the error message you posted

output of the commands (on the CryoSPARC master):

hostname -f
cryosparcm status | grep -v LICENSE
cryosparcm cli "get_scheduler_targets()"
curl http://cryosparc.cosmic:30273
host cryosparc.cosmic
ps -eo pid,ppid,cmd | grep cryosparc_

if you have access to cluster compute node where the job failed (assuming for now that processing failed after a cluster node had been allocated), outputs of the commands:
```
hostname -f
host cryosparc.cosmic
curl http://cryosparc.cosmic:30273
```

EdLowe · March 18, 2024, 4:15pm

Thanks very much - I did think that this might be an issue with port access, but am curious why this would only cause problems for CryoSPARC live. Does this use different ports to other processing? I am attaching the output from the requested commands below:

1: I'm not aware of which file this text is stored in - it is visible in the GUI as a reason for job failure in the brief time between a job failing and being re-submitted in a live session.

2: 
hostname -f  
	cryosparc
cryosparcm status | grep -v LICENSE
	----------------------------------------------------------------------------
	CryoSPARC System master node installed at
	/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master
	Current cryoSPARC version: v4.4.1
	----------------------------------------------------------------------------

	CryoSPARC process status:

	app                              RUNNING   pid 90184, uptime 3 days, 0:47:32
	app_api                          RUNNING   pid 90203, uptime 3 days, 0:47:30
	app_api_dev                      STOPPED   Not started
	command_core                     RUNNING   pid 90099, uptime 3 days, 0:47:56
	command_rtp                      RUNNING   pid 90132, uptime 3 days, 0:47:43
	command_vis                      RUNNING   pid 90128, uptime 3 days, 0:47:45
	database                         RUNNING   pid 89989, uptime 3 days, 0:48:04

	----------------------------------------------------------------------------
	License is valid
	----------------------------------------------------------------------------

	global config variables:
	export CRYOSPARC_MASTER_HOSTNAME="cryosparc.cosmic"
	export CRYOSPARC_DB_PATH="/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_database"
	export CRYOSPARC_BASE_PORT=30268
	export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
	export CRYOSPARC_INSECURE=false
	export CRYOSPARC_DB_ENABLE_AUTH=true
	export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
	export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
	export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
	export CRYOSPARC_DEVELOP=false
	export CRYOSPARC_CLICK_WRAP=true
	export CRYOSPARC_MONGO_EXTRA_FLAGS=""
	export CRYOSPARC_FORCE_HOSTNAME=true
	
cryosparcm cli "get_scheduler_targets()"
	[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic 		gpu', 'lane': 'cosmic gpu', 'name': 'cosmic gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j 	{{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
	#### cryoSPARC cluster submission script template for SLURM
	## Available variables:
	## {{ run_cmd }}            - the complete command string to run the job
	## {{ num_cpu }}            - the number of CPUs needed
	## {{ num_gpu }}            - the number of GPUs needed. 
	##                            Note: the code will use this many GPUs starting from dev id 0
	##                                  the cluster scheduler or this script have the responsibility
	##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
	##                                  using the correct cluster-allocated GPUs.
	## {{ ram_gb }}             - the amount of RAM needed in GB
	## {{ job_dir_abs }}        - absolute path to the job directory
	## {{ project_dir_abs }}    - absolute path to the project dir
	## {{ job_log_path_abs }}   - absolute path to the log file for the job
	## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
	## {{ run_args }}           - arguments to be passed to cryosparcw run
	## {{ project_uid }}        - uid of the project
	## {{ job_uid }}            - uid of the job
	## {{ job_creator }}        - name of the user that created the job (may contain spaces)
	## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
	##
	## What follows is a simple SLURM script:

	#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
	#SBATCH -c {{ num_cpu }}
	#SBATCH --gres=gpu:{{ num_gpu }}
	#SBATCH -p gpu
	#SBATCH --mem={{ (ram_gb*2000)|int }}MB             
	#SBATCH -o {{ job_dir_abs }}/slurm.out
	#SBATCH -e {{ job_dir_abs }}/slurm.err

	export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
	mkdir -p $PYCUDA_CACHE_DIR

	export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib

	{{ run_cmd }}

	rm -r $PYCUDA_CACHE_DIR
	', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu large', 'lane': 'cosmic gpu large', 'name': 'cosmic gpu large', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
	#### cryoSPARC cluster submission script template for SLURM
	## Available variables:
	## {{ run_cmd }}            - the complete command string to run the job
	## {{ num_cpu }}            - the number of CPUs needed
	## {{ num_gpu }}            - the number of GPUs needed. 
	##                            Note: the code will use this many GPUs starting from dev id 0
	##                                  the cluster scheduler or this script have the responsibility
	##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
	##                                  using the correct cluster-allocated GPUs.
	## {{ ram_gb }}             - the amount of RAM needed in GB
	## {{ job_dir_abs }}        - absolute path to the job directory
	## {{ project_dir_abs }}    - absolute path to the project dir
	## {{ job_log_path_abs }}   - absolute path to the log file for the job
	## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
	## {{ run_args }}           - arguments to be passed to cryosparcw run
	## {{ project_uid }}        - uid of the project
	## {{ job_uid }}            - uid of the job
	## {{ job_creator }}        - name of the user that created the job (may contain spaces)
	## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
	##
	## What follows is a simple SLURM script:

	#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
	#SBATCH -c {{ num_cpu }}
	#SBATCH --gres=gpu:{{ num_gpu }}
	#SBATCH -p gpu_large
	#SBATCH --mem=180000             
	#SBATCH -o {{ job_dir_abs }}/slurm.out
	#SBATCH -e {{ job_dir_abs }}/slurm.err

	export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
	mkdir -p $PYCUDA_CACHE_DIR

	export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib

	{{ run_cmd }}

	rm -r $PYCUDA_CACHE_DIR
	', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu large', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic cpu', 'lane': 'cosmic cpu', 'name': 'cosmic cpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
	#### cryoSPARC cluster submission script template for SLURM
	## Available variables:
	## {{ run_cmd }}            - the complete command string to run the job
	## {{ num_cpu }}            - the number of CPUs needed
	## {{ num_gpu }}            - the number of GPUs needed. 
	##                            Note: the code will use this many GPUs starting from dev id 0
	##                                  the cluster scheduler or this script have the responsibility
	##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
	##                                  using the correct cluster-allocated GPUs.
	## {{ ram_gb }}             - the amount of RAM needed in GB
	## {{ job_dir_abs }}        - absolute path to the job directory
	## {{ project_dir_abs }}    - absolute path to the project dir
	## {{ job_log_path_abs }}   - absolute path to the log file for the job
	## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
	## {{ run_args }}           - arguments to be passed to cryosparcw run
	## {{ project_uid }}        - uid of the project
	## {{ job_uid }}            - uid of the job
	## {{ job_creator }}        - name of the user that created the job (may contain spaces)
	## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
	##
	## What follows is a simple SLURM script:

	#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
	#SBATCH -c {{ num_cpu }}
	#SBATCH -p cpu
	#SBATCH --mem={{ (ram_gb*2000)|int }}MB             
	#SBATCH -o {{ job_dir_abs }}/slurm.out
	#SBATCH -e {{ job_dir_abs }}/slurm.err

	export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
	mkdir -p $PYCUDA_CACHE_DIR

	export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib

	{{ run_cmd }}

	rm -r $PYCUDA_CACHE_DIR
', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic cpu', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu live', 'lane': 'cosmic gpu live', 'name': 'cosmic gpu live', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
	#### cryoSPARC cluster submission script template for SLURM
	## Available variables:
	## {{ run_cmd }}            - the complete command string to run the job
	## {{ num_cpu }}            - the number of CPUs needed
	## {{ num_gpu }}            - the number of GPUs needed. 
	##                            Note: the code will use this many GPUs starting from dev id 0
	##                                  the cluster scheduler or this script have the responsibility
	##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
	##                                  using the correct cluster-allocated GPUs.
	## {{ ram_gb }}             - the amount of RAM needed in GB
	## {{ job_dir_abs }}        - absolute path to the job directory
	## {{ project_dir_abs }}    - absolute path to the project dir
	## {{ job_log_path_abs }}   - absolute path to the log file for the job
	## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
	## {{ run_args }}           - arguments to be passed to cryosparcw run
	## {{ project_uid }}        - uid of the project
	## {{ job_uid }}            - uid of the job
	## {{ job_creator }}        - name of the user that created the job (may contain spaces)
	## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
	##
	## What follows is a simple SLURM script:

	#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
	#SBATCH --cpus-per-task={%if num_cpu == 0%}1{%else%}{{ num_cpu }}{%endif%}
	#SBATCH --gres=gpu:{{ num_gpu }}
	#SBATCH -p gpu_large
	#SBATCH --mem=180000             
	#SBATCH -o {{ job_dir_abs }}/slurm.out
	#SBATCH -e {{ job_dir_abs }}/slurm.err

	export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
	mkdir -p $PYCUDA_CACHE_DIR

	export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib

	{{ run_cmd }}

	rm -r $PYCUDA_CACHE_DIR
	', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu live', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}]

curl http://cryosparc.cosmic:30273
	Hello World from cryosparc real-time processing manager.
	
host cryosparc.cosmic
	host command is not available on the cryosparc master - this is a minimal install in a kubernetes container
	
ps -eo pid,ppid,cmd | grep cryosparc_
  89875       1 python /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master/supervisord.conf
  89989   89875 mongod --auth --dbpath /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_database --port 30269 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
  90099   89875 python -c import cryosparc_command.command_core as serv; serv.start(port=30270)
  90128   89875 python -c import cryosparc_command.command_vis as serv; serv.start(port=30271)
  90132   89875 python -c import cryosparc_command.command_rtp as serv; serv.start(port=30273)
  90203   89875 /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
  95255   95141 grep cryosparc_
  
3. 
The job this time was submitted to a compute node called linux1008. I experienced the same issue regardless of the node to which the job was submitted.
hostname -g
	linux1008
	
host cryosparc.cosmic
	cryosparc.cosmic has address 10.220.0.8
	cryosparc.cosmic has address 10.220.0.7
	cryosparc.cosmic has address 10.220.0.6
	
curl http://cryosparc.cosmic:30273
	curl: (7) Failed to connect to cryosparc.cosmic port 30273: Connection refused

wtempel · March 18, 2024, 7:35pm

Correct. CryoSPARC Live needs to $((CRYOSPARC_BASE_PORT+5)) port, which sets Live apart from non-Live CryoSPARC jobs.

EdLowe · March 18, 2024, 8:45pm

Thanks, this has solved the problem - the port was not configured in the kubernetes container and correcting this has allowed cryosparc live jobs to run.