Thanks very much - I did think that this might be an issue with port access, but am curious why this would only cause problems for CryoSPARC live. Does this use different ports to other processing? I am attaching the output from the requested commands below:
1: I'm not aware of which file this text is stored in - it is visible in the GUI as a reason for job failure in the brief time between a job failing and being re-submitted in a live session.
2:
hostname -f
cryosparc
cryosparcm status | grep -v LICENSE
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master
Current cryoSPARC version: v4.4.1
----------------------------------------------------------------------------
CryoSPARC process status:
app RUNNING pid 90184, uptime 3 days, 0:47:32
app_api RUNNING pid 90203, uptime 3 days, 0:47:30
app_api_dev STOPPED Not started
command_core RUNNING pid 90099, uptime 3 days, 0:47:56
command_rtp RUNNING pid 90132, uptime 3 days, 0:47:43
command_vis RUNNING pid 90128, uptime 3 days, 0:47:45
database RUNNING pid 89989, uptime 3 days, 0:48:04
----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------
global config variables:
export CRYOSPARC_MASTER_HOSTNAME="cryosparc.cosmic"
export CRYOSPARC_DB_PATH="/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=30268
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_MONGO_EXTRA_FLAGS=""
export CRYOSPARC_FORCE_HOSTNAME=true
cryosparcm cli "get_scheduler_targets()"
[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu', 'lane': 'cosmic gpu', 'name': 'cosmic gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }} - the complete command string to run the job
## {{ num_cpu }} - the number of CPUs needed
## {{ num_gpu }} - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## {{ ram_gb }} - the amount of RAM needed in GB
## {{ job_dir_abs }} - absolute path to the job directory
## {{ project_dir_abs }} - absolute path to the project dir
## {{ job_log_path_abs }} - absolute path to the log file for the job
## {{ worker_bin_path }} - absolute path to the cryosparc worker command
## {{ run_args }} - arguments to be passed to cryosparcw run
## {{ project_uid }} - uid of the project
## {{ job_uid }} - uid of the job
## {{ job_creator }} - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -c {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu
#SBATCH --mem={{ (ram_gb*2000)|int }}MB
#SBATCH -o {{ job_dir_abs }}/slurm.out
#SBATCH -e {{ job_dir_abs }}/slurm.err
export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
mkdir -p $PYCUDA_CACHE_DIR
export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib
{{ run_cmd }}
rm -r $PYCUDA_CACHE_DIR
', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu large', 'lane': 'cosmic gpu large', 'name': 'cosmic gpu large', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }} - the complete command string to run the job
## {{ num_cpu }} - the number of CPUs needed
## {{ num_gpu }} - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## {{ ram_gb }} - the amount of RAM needed in GB
## {{ job_dir_abs }} - absolute path to the job directory
## {{ project_dir_abs }} - absolute path to the project dir
## {{ job_log_path_abs }} - absolute path to the log file for the job
## {{ worker_bin_path }} - absolute path to the cryosparc worker command
## {{ run_args }} - arguments to be passed to cryosparcw run
## {{ project_uid }} - uid of the project
## {{ job_uid }} - uid of the job
## {{ job_creator }} - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -c {{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu_large
#SBATCH --mem=180000
#SBATCH -o {{ job_dir_abs }}/slurm.out
#SBATCH -e {{ job_dir_abs }}/slurm.err
export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
mkdir -p $PYCUDA_CACHE_DIR
export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib
{{ run_cmd }}
rm -r $PYCUDA_CACHE_DIR
', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu large', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic cpu', 'lane': 'cosmic cpu', 'name': 'cosmic cpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }} - the complete command string to run the job
## {{ num_cpu }} - the number of CPUs needed
## {{ num_gpu }} - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## {{ ram_gb }} - the amount of RAM needed in GB
## {{ job_dir_abs }} - absolute path to the job directory
## {{ project_dir_abs }} - absolute path to the project dir
## {{ job_log_path_abs }} - absolute path to the log file for the job
## {{ worker_bin_path }} - absolute path to the cryosparc worker command
## {{ run_args }} - arguments to be passed to cryosparcw run
## {{ project_uid }} - uid of the project
## {{ job_uid }} - uid of the job
## {{ job_creator }} - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH -c {{ num_cpu }}
#SBATCH -p cpu
#SBATCH --mem={{ (ram_gb*2000)|int }}MB
#SBATCH -o {{ job_dir_abs }}/slurm.out
#SBATCH -e {{ job_dir_abs }}/slurm.err
export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
mkdir -p $PYCUDA_CACHE_DIR
export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib
{{ run_cmd }}
rm -r $PYCUDA_CACHE_DIR
', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic cpu', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'cosmic gpu live', 'lane': 'cosmic gpu live', 'name': 'cosmic gpu live', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## {{ run_cmd }} - the complete command string to run the job
## {{ num_cpu }} - the number of CPUs needed
## {{ num_gpu }} - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## {{ ram_gb }} - the amount of RAM needed in GB
## {{ job_dir_abs }} - absolute path to the job directory
## {{ project_dir_abs }} - absolute path to the project dir
## {{ job_log_path_abs }} - absolute path to the log file for the job
## {{ worker_bin_path }} - absolute path to the cryosparc worker command
## {{ run_args }} - arguments to be passed to cryosparcw run
## {{ project_uid }} - uid of the project
## {{ job_uid }} - uid of the job
## {{ job_creator }} - name of the user that created the job (may contain spaces)
## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --cpus-per-task={%if num_cpu == 0%}1{%else%}{{ num_cpu }}{%endif%}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH -p gpu_large
#SBATCH --mem=180000
#SBATCH -o {{ job_dir_abs }}/slurm.out
#SBATCH -e {{ job_dir_abs }}/slurm.err
export PYCUDA_CACHE_DIR=/tmp/cudacache_{{ job_uid }}_$$
mkdir -p $PYCUDA_CACHE_DIR
export LD_LIBRARY_PATH=/mnt/service/software/packages/gcc/gcc-5.5.0/lib/x86_64-linux-gnu/:/mnt/service/software/packages/gcc/gcc-5.5.0/lib:/mnt/service/software/packages/gcc/gcc-10.3.0/lib64:/mnt/service/software/packages/cuda/cuda-11.4.1/lib64:/mnt/service/software/lib
{{ run_cmd }}
rm -r $PYCUDA_CACHE_DIR
', 'send_cmd_tpl': '{{ command }}', 'title': 'cosmic gpu live', 'tpl_vars': ['run_cmd', 'ram_gb', 'project_uid', 'run_args', 'job_uid', 'worker_bin_path', 'job_log_path_abs', 'job_creator', 'cryosparc_username', 'num_cpu', 'num_gpu', 'command', 'project_dir_abs', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_worker/bin/cryosparcw'}]
curl http://cryosparc.cosmic:30273
Hello World from cryosparc real-time processing manager.
host cryosparc.cosmic
host command is not available on the cryosparc master - this is a minimal install in a kubernetes container
ps -eo pid,ppid,cmd | grep cryosparc_
89875 1 python /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master/supervisord.conf
89989 89875 mongod --auth --dbpath /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_database --port 30269 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
90099 89875 python -c import cryosparc_command.command_core as serv; serv.start(port=30270)
90128 89875 python -c import cryosparc_command.command_vis as serv; serv.start(port=30271)
90132 89875 python -c import cryosparc_command.command_rtp as serv; serv.start(port=30273)
90203 89875 /mnt/beegfs/software/structural_biology/release/cryosparc/ocms0072/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
95255 95141 grep cryosparc_
3.
The job this time was submitted to a compute node called linux1008. I experienced the same issue regardless of the node to which the job was submitted.
hostname -g
linux1008
host cryosparc.cosmic
cryosparc.cosmic has address 10.220.0.8
cryosparc.cosmic has address 10.220.0.7
cryosparc.cosmic has address 10.220.0.6
curl http://cryosparc.cosmic:30273
curl: (7) Failed to connect to cryosparc.cosmic port 30273: Connection refused