Stuck in queue forever

@jhzhu We appreciate your efforts in gathering debugging information. Unfortunately, we could not identify the cause of the problem from the logs. It is possible that some additional job(s) whose lower level inputs were required did not complete. We suggest starting processing from scratch in a new project.

OK. I started a new project to test. Just use the “Extensive Validation”. Still have the same problem.

@jhzhu We unfortunately do not what is causing this problem. I understand that currently, cryosparc_master services

  1. run within their own slurm job allocation
  2. launch additional slurm jobs,

which, taken together, constitutes two “layers” of workload management. I wonder whether simplifying workload management would help in either diagnosing or circumventing the problem.

You could try:
Alternative 1. running cryosparc_master processes outside slurm

  • CryoSPARC jobs would be submitted to slurm

Alternative 2. running cryosparc_master processes as a slurm job

  • would need to ensure that cryosparc_master processes would not be interrupted by slurm
  • would need to ensure that cryosparc_master processes are running on a GPU node
  • would need to configure CryoSPARC in single workstation mode
  • would run jobs on the same host as cryosparc_master processes
  • could simplify setup by setting CRYOSPARC_MASTER_HOSTNAME, CRYOSPARC_HOSTNAME_CHECK and the
    cryosparcw connect --worker parameter all set to localhost. (These settings are incompatible for CryoSPARC instance with worker nodes in addition to the cryosparc_master node.)

@jhzhu Did you try any of the suggestion here? Does any of them works on the cluster or do you find any better solution for this after 6 months?

We observed the same issues on our cluster. It only happened in versions after v4.4.

However, interestingly, the “sleeping” queued job will be awakened under two situations from our observation so far:

  • Only once after freshly restarting master with command: cryosparcm restart; but not the second attempt.
  • If multiple parallel jobs queued together, clone and submit one of them will activate the others.

@wtempel Thanks for previous suggestions, but unfortunately, they’re not feasible under our cluster environment. I’m just curious why it works before v4.4 but not versions after that. Thanks for any updates or follow up with this issue as we and perhaps a lot of other users heavily depend on cluster environment.

Welcome to the forum @morganyang422.

Do your CryoSPARC master processes run outside cluster workload management, that is, CryoSPARC jobs are submitted to a cluster-type scheduler lane, but commands like cryosparcm are not run “inside” a cluster job?
Please can you post the output of these commands on the computer where CryoSPARC master processes run:

ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo

Thanks @wtempel for the quick reponse. And sorry for any delay but I had to finish a presentation this morning. To answer your questions:
“Do your CryoSPARC master processes run outside cluster workload management, that is, CryoSPARC jobs are submitted to a cluster-type scheduler lane, but commands like cryosparcm are not run “inside” a cluster job?”
No, given the size of our Cryosparc user pool on the cluster, it’s not feasible for us to setup some “login nodes” to host Cryosparc master processes. Both of our Cryosparc master and worker jobs have to be running on compute nodes (cluster jobs). I think probably most HPC would like Cryosparc to run it that way in a cluster environment. And it works just fine before v4.4

“Please can you post the output of these commands on the computer where CryoSPARC master processes run:”
“ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock”

/tmp/cryosparc-supervisor-e29cccc7c796955a404d28cb9d299db7.sock
/tmp/mongodb-39023.sock

“ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo”

1908613       1   Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1916547       1   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/supervisord.conf
1916662 1916547   Oct 30 mongod --auth --dbpath /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1916775 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916793 1916775   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916819 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916821 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916834 1916821   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916835 1916819   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916898 1916547   Oct 30 /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2037000 2036287 15:35:29 grep --color=auto -e cryosparc -e mongo

Let me know if there is anything else I can do to help with the trouble shoot here. THANKS

Interesting. In this case, may I ask you to run those commands again, and a few additional commands

ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
grep HOSTNAME /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/config.sh
/gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/bin/cryosparcm cli "get_scheduler_targets()"
env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES

Thanks again for the speedy reply @wtempel. Yes, sure:

$ ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
/tmp/cryosparc-supervisor-e29cccc7c796955a404d28cb9d299db7.sock
/tmp/mongodb-39023.sock
$ ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
1908613       1   Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1916547       1   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/supervisord.conf
1916662 1916547   Oct 30 mongod --auth --dbpath /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1916775 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916793 1916775   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916819 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916821 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916834 1916821   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916835 1916819   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916898 1916547   Oct 30 /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2042350 1910777 19:16:54 grep --color=auto -e cryosparc -e mongo
$ grep HOSTNAME /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/config.sh
export CRYOSPARC_MASTER_HOSTNAME="cn1637"
$ /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/bin/cryosparcm cli "get_scheduler_targets()"
[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {'gpu_type': 'v100x', 'lscratch': '460', 'num_hours': '5', 'ram_gb_multiplier': '1'}, 'desc': None, 'hostname': 'slurm_auto', 'lane': 'slurm_auto', 'name': 'slurm_auto', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ gpu_type }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm_auto', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': '/lscratch/25547743', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {'gpu_type': 'v100x', 'lscratch': '460', 'num_hours': '5'}, 'desc': None, 'hostname': 'slurm_custom', 'lane': 'slurm_custom', 'name': 'slurm_custom', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm_custom', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'multi_gpu', 'lane': 'multi_gpu', 'name': 'multi_gpu', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'multi_gpu', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'single_gpu', 'lane': 'single_gpu', 'name': 'single_gpu', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:1\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'single_gpu', 'tpl_vars': ['lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'mig_gpu', 'lane': 'mig_gpu', 'name': 'mig_gpu', 'qdel_cmd_tpl': '/usr/local/slurmdev/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurmdev/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurmdev/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("a100_1g.10g")) }}:1\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'mig_gpu', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'}]
$ env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
###None showed from this command as slurm was run as a module and called when needed:
$ env |  grep slurm
__LMOD_REF_COUNT_PATH=/usr/local/apps/cryosparc/sbin:1;/data/yangr3/apps/dutree/bin:1;/usr/local/slurm/bin:1;/usr/local/bin:2;/usr/X11R6/bin:1;/usr/local/jdk/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/usr/local/mysql/bin:1;/home/yangr3/.local/bin:1;/home/yangr3/bin:1
MANPATH=/usr/local/slurm/share/man:/usr/local/lmod/lmod/8.7/share/man:
PATH=/data/yangr3/apps/cryosparc/cryosparc_master/bin:/data/yangr3/apps/cryosparc/cryosparc_worker/bin:/data/yangr3/apps/cryosparc/cryosparc_master/bin:/data/yangr3/apps/cryosparc/cryosparc_worker/bin:/usr/local/apps/cryosparc/sbin:/data/yangr3/apps/dutree/bin:/usr/local/slurm/bin:/usr/local/bin:/usr/X11R6/bin:/usr/local/jdk/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/mysql/bin:/home/yangr3/.local/bin:/home/yangr3/bin

Thanks for posting this information.
Regarding the command

I failed to provide the proper context.
Please can you run these commands, specifying the slurm job id (in place of <slurm_job_id> under which CryoSPARC master processes are running for each.

srun --overlap --jobid=<slurm_job_id> env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
scontrol listpids <slurm_job_id>

Thanks for the info, the result of the suggested commands:

“srun --overlap --jobid=39697325 env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES”

SLURM_JOB_CPUS_PER_NODE=4
SLURM_MEM_PER_NODE=32768
SLURM_NPROCS=1
SLURM_CPUS_ON_NODE=4

“scontrol listpids 39697325”

PID JOBID STEPID LOCALID GLOBALID
1906880 39697325 batch 0 0
1908613 39697325 batch - -
3033299 39697325 batch - -
3048305 39697325 batch - -
-1 39697325 extern 0 0
1906874 39697325 extern - -

Thanks for posting this information. I expected to see PIDs matching your earlier ps output in Stuck in queue forever - #33 by morganyang422, but the PIDs now are different. What is the output of the command

srun --overlap --jobid=39697325 ps -o pid,start,command 1906880 1908613 3033299 3048305 1906874

Thanks, but I couldn’t get info from pid 3048305:

“srun --overlap --jobid=39697325 ps -o pid,start,command 1906880 1908613 3033299 3048305 1906874”

PID STARTED COMMAND
1906874 Oct 30 sleep 100000000
1906880 Oct 30 /bin/bash /var/spool/slurm/slurmd/job39697325/slurm_script
1908613 Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
3033299 00:02:25 sleep 86314

Job 39697325 may not be the slurm job we are interested in. Is there a slurm job under which the gunicorn, mongod and supervisord processed mentioned in Stuck in queue forever - #31 by morganyang422 are running? If so, please can you post the outputs of the commands (with the relevant <slurm_job_id>)

srun --overlap --jobid=<slurm_job_id> env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
scontrol listpids <slurm_job_id>

Thanks for the suggestion, but I double checked, there’s no other slurm job that will run the maser process. Only one slurm job was launched per each cryosaprc master instance. Will keep in touch with any update.

Please can you share the slurm command and, if applicable, script that launches a CryoSPARC instance on your cluster.

Sure, it’s a set of scripts, but hope this is helpful:

#!/bin/bash
#SBATCH -c 4 --mem=32g --constraint=x2695 --partition=norm --job-name=cryosparc_master --signal=B:TERM@1200 --output=/home/%u/cryosparc_master.%j.out --comment="cryosparc_master" --time=10-00:00:00

cleanup() {
    ${CRYOSPARC_HOME}/libexec/shutdown.sh >> ${CRYOSPARC_BASE}/sbatch.log

# if all goes well and smoothly, the job will exit with EC=0 and will be marked as COMPLETED
    exit 0
}    
trap cleanup 1 2 3 4 5 6 7 8 9 10 SIGINT SIGTERM

# Set up the environment
source ${CRYOSPARC_HOME}/libexec/enable_proxy.sh
source ${CRYOSPARC_HOME}/libexec/enable_waits.sh
source ${CRYOSPARC_BASE}/config.cnf

# Is cryosparc installed?
[[ -f ${CRYOSPARC_BASE}/cryosparc_master/bin/cryosparcm ]] || { echo "CryoSPARC not installed in ${CRYOSPARC_BASE}"; exit 1; }
[[ -f ${CRYOSPARC_BASE}/cryosparc_master/config.sh ]] || { echo "CryoSPARC not installed in ${CRYOSPARC_BASE}"; exit 1; }
[[ -w ${CRYOSPARC_BASE}/cryosparc_master/config.sh ]] || { echo "CryoSPARC config.sh not writable ${CRYOSPARC_BASE}"; exit 1; }

# Is cryosparc already running?
[[ -f ${CRYOSPARC_BASE}/cryosparc_master/run/supervisord.pid ]] && ps -p $(cat ${CRYOSPARC_BASE}/cryosparc_master/run/supervisord.pid) >& /dev/null && { echo "CryoSPARC already running!"; exit 1; }

# Start up the server
${CRYOSPARC_HOME}/libexec/startup.sh >> ${CRYOSPARC_BASE}/sbatch.log

# Establish tunnel from biowulf to this node
/usr/local/slurm/libexec/tunnelcommand_tcp.sh --cryosparc

# Wait until done
wait_until_done ${SLURM_JOB_ID}
1 Like

@morganyang422 Please can you email us a compressed copy of the file

/gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/run/command_core.log

I will send you a direct message with the relevant email address.

@morganyang422 Please can you check if requesting a larger number of CPUs (8 or more) can help with this issue?

@wtempel Thanks for the suggestion. I tested with 16 cpus and 32g memory, sitll didn’t solve the problem. The maximum cpus usage is 2 from our job monitor tools. Thanks for further suggestion or comment.

Thanks for trying that. Another thing to try: what is the output of the command

cat /sys/kernel/mm/transparent_hugepage/enabled

? If the output shows
[always] madvise never, you may want to test if disabling transparent_hugepage (applying the never setting), resolves the problem.