Followed your instructions and sent you the log file by email. J70 still waited in the queue.
After turning off the logging and cryosparcm restart
, J70 started running.
Followed your instructions and sent you the log file by email. J70 still waited in the queue.
After turning off the logging and cryosparcm restart
, J70 started running.
Thanks for sending the logs. Please can you also post
cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J69', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d956481a656dde2f4a856', 'completed_at': 'Fri, 19 Jan 2024 21:54:13 GMT', 'created_at': 'Tue, 09 Jan 2024 18:50:12 GMT', 'job_type': 'restack_particles', 'launched_at': 'Fri, 19 Jan 2024 21:49:33 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:31 GMT', 'started_at': 'Fri, 19 Jan 2024 21:50:10 GMT', 'uid': 'J69'}
cryosparcm cli "get_job('P16', 'J70', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '659d95ca81a656dde2f4e484', 'completed_at': None, 'created_at': 'Tue, 09 Jan 2024 18:51:54 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Fri, 19 Jan 2024 21:59:22 GMT', 'project_uid': 'P16', 'queued_at': 'Fri, 19 Jan 2024 21:49:37 GMT', 'started_at': 'Fri, 19 Jan 2024 22:01:23 GMT', 'uid': 'J70'}
Interesting. Please can you also run this command:
cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
cryosparcm cli "get_job('P16', 'J53', 'job_type', 'created_at', 'queued_at', 'launched_at', 'started_at', 'completed_at')"
{'_id': '6596bf229906ac8299b73c58', 'completed_at': 'Tue, 02 Jan 2024 03:58:58 GMT', 'created_at': 'Tue, 02 Jan 2024 03:12:59 GMT', 'job_type': 'nonuniform_refine_new', 'launched_at': 'Tue, 02 Jan 2024 03:13:53 GMT', 'project_uid': 'P16', 'queued_at': 'Tue, 02 Jan 2024 03:13:52 GMT', 'started_at': 'Tue, 02 Jan 2024 03:14:06 GMT', 'uid': 'J53'}
@jhzhu We appreciate your efforts in gathering debugging information. Unfortunately, we could not identify the cause of the problem from the logs. It is possible that some additional job(s) whose lower level inputs were required did not complete. We suggest starting processing from scratch in a new project.
OK. I started a new project to test. Just use the “Extensive Validation”. Still have the same problem.
@jhzhu We unfortunately do not what is causing this problem. I understand that currently, cryosparc_master
services
which, taken together, constitutes two “layers” of workload management. I wonder whether simplifying workload management would help in either diagnosing or circumventing the problem.
You could try:
Alternative 1. running cryosparc_master
processes outside slurm
Alternative 2. running cryosparc_master
processes as a slurm job
cryosparc_master
processes would not be interrupted by slurmcryosparc_master
processes are running on a GPU nodecryosparc_master
processesCRYOSPARC_MASTER_HOSTNAME
, CRYOSPARC_HOSTNAME_CHECK
and thecryosparcw connect --worker
parameter all set to localhost
. (These settings are incompatible for CryoSPARC instance with worker nodes in addition to the cryosparc_master
node.)@jhzhu Did you try any of the suggestion here? Does any of them works on the cluster or do you find any better solution for this after 6 months?
We observed the same issues on our cluster. It only happened in versions after v4.4.
However, interestingly, the “sleeping” queued job will be awakened under two situations from our observation so far:
@wtempel Thanks for previous suggestions, but unfortunately, they’re not feasible under our cluster environment. I’m just curious why it works before v4.4 but not versions after that. Thanks for any updates or follow up with this issue as we and perhaps a lot of other users heavily depend on cluster environment.
Welcome to the forum @morganyang422.
Do your CryoSPARC master processes run outside cluster workload management, that is, CryoSPARC jobs are submitted to a cluster-type scheduler lane, but commands like cryosparcm
are not run “inside” a cluster job?
Please can you post the output of these commands on the computer where CryoSPARC master processes run:
ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
Thanks @wtempel for the quick reponse. And sorry for any delay but I had to finish a presentation this morning. To answer your questions:
“Do your CryoSPARC master processes run outside cluster workload management, that is, CryoSPARC jobs are submitted to a cluster-type scheduler lane, but commands like cryosparcm are not run “inside” a cluster job?”
No, given the size of our Cryosparc user pool on the cluster, it’s not feasible for us to setup some “login nodes” to host Cryosparc master processes. Both of our Cryosparc master and worker jobs have to be running on compute nodes (cluster jobs). I think probably most HPC would like Cryosparc to run it that way in a cluster environment. And it works just fine before v4.4
“Please can you post the output of these commands on the computer where CryoSPARC master processes run:”
“ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock”
/tmp/cryosparc-supervisor-e29cccc7c796955a404d28cb9d299db7.sock
/tmp/mongodb-39023.sock
“ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo”
1908613 1 Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1916547 1 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/supervisord.conf
1916662 1916547 Oct 30 mongod --auth --dbpath /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1916775 1916547 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916793 1916775 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916819 1916547 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916821 1916547 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916834 1916821 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916835 1916819 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916898 1916547 Oct 30 /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2037000 2036287 15:35:29 grep --color=auto -e cryosparc -e mongo
Let me know if there is anything else I can do to help with the trouble shoot here. THANKS
Interesting. In this case, may I ask you to run those commands again, and a few additional commands
ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
grep HOSTNAME /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/config.sh
/gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/bin/cryosparcm cli "get_scheduler_targets()"
env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
Thanks again for the speedy reply @wtempel. Yes, sure:
$ ls -1 /tmp/mongo*.sock /tmp/cryosparc*.sock
/tmp/cryosparc-supervisor-e29cccc7c796955a404d28cb9d299db7.sock
/tmp/mongodb-39023.sock
$ ps -eo pid,ppid,start,command | grep -e cryosparc -e mongo
1908613 1 Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
1916547 1 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/supervisord.conf
1916662 1916547 Oct 30 mongod --auth --dbpath /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_database --port 39023 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
1916775 1916547 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916793 1916775 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:39024 cryosparc_command.command_core:start() -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916819 1916547 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916821 1916547 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916834 1916821 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:39027 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916835 1916819 Oct 30 python /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:39025 -c /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/gunicorn.conf.py
1916898 1916547 Oct 30 /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2042350 1910777 19:16:54 grep --color=auto -e cryosparc -e mongo
$ grep HOSTNAME /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/config.sh
export CRYOSPARC_MASTER_HOSTNAME="cn1637"
$ /gpfs/gsfs12/users/yangr3/apps/cryosparc/cryosparc_master/bin/cryosparcm cli "get_scheduler_targets()"
[{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {'gpu_type': 'v100x', 'lscratch': '460', 'num_hours': '5', 'ram_gb_multiplier': '1'}, 'desc': None, 'hostname': 'slurm_auto', 'lane': 'slurm_auto', 'name': 'slurm_auto', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ gpu_type }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm_auto', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
{'cache_path': '/lscratch/25547743', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {'gpu_type': 'v100x', 'lscratch': '460', 'num_hours': '5'}, 'desc': None, 'hostname': 'slurm_custom', 'lane': 'slurm_custom', 'name': 'slurm_custom', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm_custom', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'multi_gpu', 'lane': 'multi_gpu', 'name': 'multi_gpu', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:{{ num_gpu }}\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'multi_gpu', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'single_gpu', 'lane': 'single_gpu', 'name': 'single_gpu', 'qdel_cmd_tpl': '/usr/local/slurm/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurm/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurm/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("p100")) }}:1\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'single_gpu', 'tpl_vars': ['lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'},
{'cache_path': None, 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['lscratch', 'ram_gb_multiplier', 'gpu_type', 'num_hours'], 'custom_vars': {}, 'desc': None, 'hostname': 'mig_gpu', 'lane': 'mig_gpu', 'name': 'mig_gpu', 'qdel_cmd_tpl': '/usr/local/slurmdev/bin/scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': "/usr/local/slurmdev/bin/sinfo --format='%.8N %.6D %.10P %.6T %.14C %.5c %.6z %.7m %.7G %.9d %20E'", 'qstat_cmd_tpl': '/usr/local/bin/sjobs --squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': '/usr/local/slurmdev/bin/sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_dir_abs }}/output.txt\n#SBATCH --error={{ job_dir_abs }}/error.txt\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int }}GB\n#SBATCH --time={{ (num_hours|default(4))|int }}:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --cpus-per-task=1\n{%- if num_gpu == 0 %}\n#SBATCH --partition=norm\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }}\n{%- else %}\n#SBATCH --partition=gpu\n#SBATCH --gres=lscratch:{{ (lscratch|default(100))|int }},gpu:{{ (gpu_type|default("a100_1g.10g")) }}:1\n{%- endif %}\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'mig_gpu', 'tpl_vars': ['num_gpu', 'lscratch', 'command', 'project_uid', 'run_cmd', 'ram_gb_multiplier', 'num_cpu', 'job_uid', 'gpu_type', 'num_hours', 'ram_gb', 'cluster_job_id', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/data/yangr3/apps/cryosparc/cryosparc_worker/bin/cryosparcw'}]
$ env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
###None showed from this command as slurm was run as a module and called when needed:
$ env | grep slurm
__LMOD_REF_COUNT_PATH=/usr/local/apps/cryosparc/sbin:1;/data/yangr3/apps/dutree/bin:1;/usr/local/slurm/bin:1;/usr/local/bin:2;/usr/X11R6/bin:1;/usr/local/jdk/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/usr/local/mysql/bin:1;/home/yangr3/.local/bin:1;/home/yangr3/bin:1
MANPATH=/usr/local/slurm/share/man:/usr/local/lmod/lmod/8.7/share/man:
PATH=/data/yangr3/apps/cryosparc/cryosparc_master/bin:/data/yangr3/apps/cryosparc/cryosparc_worker/bin:/data/yangr3/apps/cryosparc/cryosparc_master/bin:/data/yangr3/apps/cryosparc/cryosparc_worker/bin:/usr/local/apps/cryosparc/sbin:/data/yangr3/apps/dutree/bin:/usr/local/slurm/bin:/usr/local/bin:/usr/X11R6/bin:/usr/local/jdk/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/mysql/bin:/home/yangr3/.local/bin:/home/yangr3/bin
Thanks for posting this information.
Regarding the command
I failed to provide the proper context.
Please can you run these commands, specifying the slurm job id (in place of <slurm_job_id>
under which CryoSPARC master processes are running for each.
srun --overlap --jobid=<slurm_job_id> env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
scontrol listpids <slurm_job_id>
Thanks for the info, the result of the suggested commands:
“srun --overlap --jobid=39697325 env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES”
SLURM_JOB_CPUS_PER_NODE=4
SLURM_MEM_PER_NODE=32768
SLURM_NPROCS=1
SLURM_CPUS_ON_NODE=4
“scontrol listpids 39697325”
PID JOBID STEPID LOCALID GLOBALID
1906880 39697325 batch 0 0
1908613 39697325 batch - -
3033299 39697325 batch - -
3048305 39697325 batch - -
-1 39697325 extern 0 0
1906874 39697325 extern - -
Thanks for posting this information. I expected to see PIDs matching your earlier ps
output in Stuck in queue forever - #33 by morganyang422, but the PIDs now are different. What is the output of the command
srun --overlap --jobid=39697325 ps -o pid,start,command 1906880 1908613 3033299 3048305 1906874
Thanks, but I couldn’t get info from pid 3048305:
“srun --overlap --jobid=39697325 ps -o pid,start,command 1906880 1908613 3033299 3048305 1906874”
PID STARTED COMMAND
1906874 Oct 30 sleep 100000000
1906880 Oct 30 /bin/bash /var/spool/slurm/slurmd/job39697325/slurm_script
1908613 Oct 30 /bin/bash /usr/local/apps/cryosparc/libexec/auto_archive.sh
3033299 00:02:25 sleep 86314
Job 39697325 may not be the slurm job we are interested in. Is there a slurm job under which the gunicorn
, mongod
and supervisord
processed mentioned in Stuck in queue forever - #31 by morganyang422 are running? If so, please can you post the outputs of the commands (with the relevant <slurm_job_id>
)
srun --overlap --jobid=<slurm_job_id> env | grep SLURM_ | grep -e NPROC -e CPUS -e MEM -e TRES
scontrol listpids <slurm_job_id>
Thanks for the suggestion, but I double checked, there’s no other slurm job that will run the maser process. Only one slurm job was launched per each cryosaprc master instance. Will keep in touch with any update.
Please can you share the slurm command and, if applicable, script that launches a CryoSPARC instance on your cluster.