Stuck in queue forever

jhzhu · January 19, 2024, 10:01pm

Followed your instructions and sent you the log file by email. J70 still waited in the queue.

After turning off the logging and cryosparcm restart, J70 started running.

wtempel · January 19, 2024, 10:23pm

Thanks for sending the logs. Please can you also post

outputs of the commands

cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"

the expanded Inputs section of job J70.

jhzhu · January 19, 2024, 10:26pm

cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d956481a656dde2f4a856', 'completed_at': 'Fri, 19 Jan 2024 21:54:13 GMT', 'created_at': 'Tue, 09 Jan 2024 18:50:12 GMT', 'job_type': 'restack_particles', 'launched_at': 'Fri, 19 Jan 2024 21:49:33 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:31 GMT', 'started_at': 'Fri, 19 Jan 2024 21:50:10 GMT', 'uid': 'J69'}
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d95ca81a656dde2f4e484', 'completed_at': None, 'created_at': 'Tue, 09 Jan 2024 18:51:54 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Fri, 19 Jan 2024 21:59:22 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:37 GMT', 'started_at': 'Fri, 19 Jan 2024 22:01:23 GMT', 'uid': 'J70'}

wtempel · January 19, 2024, 11:19pm

Interesting. Please can you also run this command:

cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"

jhzhu · January 19, 2024, 11:24pm

cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '6596bf229906ac8299b73c58', 'completed_at': 'Tue, 02 Jan 2024 03:58:58 GMT', 'created_at': 'Tue, 02 Jan 2024 03:12:59 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Tue, 02 Jan 2024 03:13:53 GMT', 'project_uid': 'P16', 'queued_at': 'Tue, 02 Jan 2024 03:13:52 GMT', 'started_at': 'Tue, 02 Jan 2024 03:14:06 GMT', 'uid': 'J53'}

wtempel · January 24, 2024, 10:48pm

@jhzhu We appreciate your efforts in gathering debugging information. Unfortunately, we could not identify the cause of the problem from the logs. It is possible that some additional job(s) whose lower level inputs were required did not complete. We suggest starting processing from scratch in a new project.

jhzhu · January 25, 2024, 4:11pm

OK. I started a new project to test. Just use the “Extensive Validation”. Still have the same problem.

wtempel · February 1, 2024, 7:10pm

@jhzhu We unfortunately do not what is causing this problem. I understand that currently, cryosparc_master services

run within their own slurm job allocation
launch additional slurm jobs,

which, taken together, constitutes two “layers” of workload management. I wonder whether simplifying workload management would help in either diagnosing or circumventing the problem.

You could try:
Alternative 1. running cryosparc_master processes outside slurm

CryoSPARC jobs would be submitted to slurm

Alternative 2. running cryosparc_master processes as a slurm job

would need to ensure that cryosparc_master processes would not be interrupted by slurm
would need to ensure that cryosparc_master processes are running on a GPU node
would need to configure CryoSPARC in single workstation mode
would run jobs on the same host as cryosparc_master processes
could simplify setup by setting CRYOSPARC_MASTER_HOSTNAME, CRYOSPARC_HOSTNAME_CHECK and the
cryosparcw connect --worker parameter all set to localhost. (These settings are incompatible for CryoSPARC instance with worker nodes in addition to the cryosparc_master node.)

morganyang422 · October 30, 2024, 9:48pm

@jhzhu Did you try any of the suggestion here? Does any of them works on the cluster or do you find any better solution for this after 6 months?

We observed the same issues on our cluster. It only happened in versions after v4.4.

However, interestingly, the “sleeping” queued job will be awakened under two situations from our observation so far:

Only once after freshly restarting master with command: cryosparcm restart; but not the second attempt.
If multiple parallel jobs queued together, clone and submit one of them will activate the others.

@wtempel Thanks for previous suggestions, but unfortunately, they’re not feasible under our cluster environment. I’m just curious why it works before v4.4 but not versions after that. Thanks for any updates or follow up with this issue as we and perhaps a lot of other users heavily depend on cluster environment.

wtempel · October 31, 2024, 12:57pm

Welcome to the forum @morganyang422.

Do your CryoSPARC master processes run outside cluster workload management, that is, CryoSPARC jobs are submitted to a cluster-type scheduler lane, but commands like cryosparcm are not run “inside” a cluster job?
Please can you post the output of these commands on the computer where CryoSPARC master processes run:

ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo

morganyang422 · November 1, 2024, 7:42pm

Thanks @wtempel for the quick reponse. And sorry for any delay but I had to finish a presentation this morning. To answer your questions:
“Do your CryoSPARC master processes run outside cluster workload management, that is, CryoSPARC jobs are submitted to a cluster-type scheduler lane, but commands like cryosparcm are not run “inside” a cluster job?”
No, given the size of our Cryosparc user pool on the cluster, it’s not feasible for us to setup some “login nodes” to host Cryosparc master processes. Both of our Cryosparc master and worker jobs have to be running on compute nodes (cluster jobs). I think probably most HPC would like Cryosparc to run it that way in a cluster environment. And it works just fine before v4.4

“Please can you post the output of these commands on the computer where CryoSPARC master processes run:”
“ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock”

/tmp/cryosparc-supervisor-e29cccc7c796955a404d28cb9d299db7.sock
/tmp/mongodb-39023.sock

“ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo”

1908613       1   Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1916547       1   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/supervisord.conf
1916662 1916547   Oct 30 mongod --auth --dbpath /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1916775 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916793 1916775   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916819 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916821 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916834 1916821   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916835 1916819   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916898 1916547   Oct 30 /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2037000 2036287 15:35:29 grep --color=auto -e cryosparc -e mongo

Let me know if there is anything else I can do to help with the trouble shoot here. THANKS

wtempel · November 1, 2024, 8:53pm

Interesting. In this case, may I ask you to run those commands again, and a few additional commands

ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
grep HOSTNAME /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/config.sh
/gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/bin/cryosparcm cli "get_scheduler_targets()"
env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES

morganyang422 · November 1, 2024, 11:31pm

Thanks again for the speedy reply @wtempel. Yes, sure:

$ ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
/tmp/cryosparc-supervisor-e29cccc7c796955a404d28cb9d299db7.sock
/tmp/mongodb-39023.sock

$ ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
1908613       1   Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1916547       1   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/supervisord.conf
1916662 1916547   Oct 30 mongod --auth --dbpath /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1916775 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916793 1916775   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916819 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916821 1916547   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916834 1916821   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916835 1916819   Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916898 1916547   Oct 30 /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2042350 1910777 19:16:54 grep --color=auto -e cryosparc -e mongo

$ grep HOSTNAME /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/config.sh
export CRYOSPARC_MASTER_HOSTNAME="cn1637"

$ /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/bin/cryosparcm cli "get_scheduler_targets()"
[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {'gpu_type': 'v100x', 'lscratch': '460', 'num_hours': '5', 'ram_gb_multiplier': '1'}, 'desc': None, 'hostname': 'slurm_auto', 'lane': 'slurm_auto', 'name': 'slurm_auto', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ gpu_type }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm_auto', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': '/lscratch/25547743', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {'gpu_type': 'v100x', 'lscratch': '460', 'num_hours': '5'}, 'desc': None, 'hostname': 'slurm_custom', 'lane': 'slurm_custom', 'name': 'slurm_custom', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm_custom', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'multi_gpu', 'lane': 'multi_gpu', 'name': 'multi_gpu', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'multi_gpu', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'single_gpu', 'lane': 'single_gpu', 'name': 'single_gpu', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:1\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'single_gpu', 'tpl_vars': ['lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'mig_gpu', 'lane': 'mig_gpu', 'name': 'mig_gpu', 'qdel_cmd_tpl': '/usr/local/slurmdev/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurmdev/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurmdev/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("a100_1g.10g")) }}:1\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'mig_gpu', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'}]

$ env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
###None showed from this command as slurm was run as a module and called when needed:

$ env |  grep slurm
__LMOD_REF_COUNT_PATH=/usr/local/apps/cryosparc/sbin:1;/data/yangr3/apps/dutree/bin:1;/usr/local/slurm/bin:1;/usr/local/bin:2;/usr/X11R6/bin:1;/usr/local/jdk/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/usr/local/mysql/bin:1;/home/yangr3/.local/bin:1;/home/yangr3/bin:1
MANPATH=/usr/local/slurm/share/man:/usr/local/lmod/lmod/8.7/share/man:
PATH=/data/yangr3/apps/cryosparc/cryosparc_master/bin:/data/yangr3/apps/cryosparc/cryosparc_worker/bin:/data/yangr3/apps/cryosparc/cryosparc_master/bin:/data/yangr3/apps/cryosparc/cryosparc_worker/bin:/usr/local/apps/cryosparc/sbin:/data/yangr3/apps/dutree/bin:/usr/local/slurm/bin:/usr/local/bin:/usr/X11R6/bin:/usr/local/jdk/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/mysql/bin:/home/yangr3/.local/bin:/home/yangr3/bin

wtempel · November 4, 2024, 3:37pm

Thanks for posting this information.
Regarding the command

I failed to provide the proper context.
Please can you run these commands, specifying the slurm job id (in place of <slurm_job_id> under which CryoSPARC master processes are running for each.

srun --overlap --jobid=<slurm_job_id> env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
scontrol listpids <slurm_job_id>

morganyang422 · November 5, 2024, 3:29pm

Thanks for the info, the result of the suggested commands:

“srun --overlap --jobid=39697325 env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES”

SLURM_JOB_CPUS_PER_NODE=4
SLURM_MEM_PER_NODE=32768
SLURM_NPROCS=1
SLURM_CPUS_ON_NODE=4

“scontrol listpids 39697325”

PID JOBID STEPID LOCALID GLOBALID
1906880 39697325 batch 0 0
1908613 39697325 batch - -
3033299 39697325 batch - -
3048305 39697325 batch - -
-1 39697325 extern 0 0
1906874 39697325 extern - -

wtempel · November 5, 2024, 11:09pm

Thanks for posting this information. I expected to see PIDs matching your earlier ps output in Stuck in queue forever - #33 by morganyang422, but the PIDs now are different. What is the output of the command

srun --overlap --jobid=39697325 ps -o pid,start,command 1906880 1908613 3033299 3048305 1906874

morganyang422 · November 6, 2024, 2:33am

Thanks, but I couldn’t get info from pid 3048305:

“srun --overlap --jobid=39697325 ps -o pid,start,command 1906880 1908613 3033299 3048305 1906874”

PID STARTED COMMAND
1906874 Oct 30 sleep 100000000
1906880 Oct 30 /bin/bash /var/spool/slurm/slurmd/job39697325/slurm_script
1908613 Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
3033299 00:02:25 sleep 86314

wtempel · November 6, 2024, 5:20pm

Job 39697325 may not be the slurm job we are interested in. Is there a slurm job under which the gunicorn, mongod and supervisord processed mentioned in Stuck in queue forever - #31 by morganyang422 are running? If so, please can you post the outputs of the commands (with the relevant <slurm_job_id>)

srun --overlap --jobid=<slurm_job_id> env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
scontrol listpids <slurm_job_id>

morganyang422 · November 7, 2024, 4:42pm

Thanks for the suggestion, but I double checked, there’s no other slurm job that will run the maser process. Only one slurm job was launched per each cryosaprc master instance. Will keep in touch with any update.

wtempel · November 7, 2024, 7:06pm

Please can you share the slurm command and, if applicable, script that launches a CryoSPARC instance on your cluster.