I got similar problem too. In my case, it seems be related to GPU over heating. This occurs more often in the jobs with entensitve use of GPU, such as local refinement after symmetry expansion, or 2D classification if only use single GPU. When I used nvidia-smi -l monitor GPU temperature, when GPU temperature got to 87C, the GPU stopped working, and the job is hanging forever. If I limit the power of GPU to about 2/3 of its full power and re-run local refinement, then it can get through.
There’s also a NU-Refine job that shows the same symptoms now.
Job log is stalled at:
[2024-10-08 4:11:42.90] [CPU: 11.60 GB] Starting particle processing for split A..
[2024-10-08 4:11:43.13] [CPU: 11.60 GB] batch 44 of 189
CPU% is 0, with short periodic bursts of 0.7 (as noted above, I suspect these being the heartbeat responses).
RSS is at 13 GB, there is still a large amount of free space on the node.
Interestingly, the Virtual Memory held by the idle process is extremely large at almost 15 TB, I cannot say if this normal behaviour.
In this instance, there were some stacktrace entries in the job log when sending the SIGABRT:
========= sending heartbeat at 2024-10-08 09:49:51.518028
Received SIGABRT (addr=000000000007669d)
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f0
568879953]
/lib64/libpthread.so.0(+0x12cf0)[0x7f05726a4cf0]
/lib64/libc.so.6(syscall+0x1d)[0x7f0571b7c9bd]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(event_wait+0x3f)[0x7f0568879f1f]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(iosys_wait_for_completion+0x19)[0x7f0
568885739]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(wrap_iosys_wait_for_completion+0x51)[
0x7f05688a7c21]
python(+0x1445a6)[0x56014de465a6]
python(_PyObject_MakeTpCall+0x26b)[0x56014de3fa6b]
python(_PyEval_EvalFrameDefault+0x54a6)[0x56014de3b9d6]
python(_PyFunction_Vectorcall+0x6c)[0x56014de46a2c]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/engine/newengine.cpython-310-x86_64-linux-gnu.so(+0x7f
06b)[0x7f053677d06b]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/ctf_refinement/run_local.cpython-310-x86_64-linux
-gnu.so(+0x2929d)[0x7f05353bb29d]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xcd14)[0x7f0572d
43d14]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/ctf_refinement/run_local.cpython-310-x86_64-linux-gnu.so(+0x19455)[0x7f05353ab455]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xcd14)[0x7f0572d43d14]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/refine/newrun.cpython-310-x86_64-linux-gnu.so(+0x1b0ee)[0x7f0566ad80ee]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/refine/newrun.cpython-310-x86_64-linux-gnu.so(+0x2248ba)[0x7f0566ce18ba]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x20e91)[0x7f0572d57e91]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x12c31)[0x7f0572d49c31]
python(_PyEval_EvalFrameDefault+0x4c12)[0x56014de3b142]
python(+0x1d7c60)[0x56014ded9c60]
python(PyEval_EvalCode+0x87)[0x56014ded9ba7]
python(+0x20812a)[0x56014df0a12a]
python(+0x203523)[0x56014df05523]
python(PyRun_StringFlags+0x7d)[0x56014defd91d]
python(PyRun_SimpleStringFlags+0x3c)[0x56014defd75c]
python(Py_RunMain+0x26b)[0x56014defc66b]
python(Py_BytesMain+0x37)[0x56014decd1f7]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f0571b7dd85]
python(+0x1cb0f1)[0x56014decd0f1]
rax fffffffffffffffc rbx 0000560156a8e3d8 rcx 00007f0571b7c9bd rdx 0000000000000001
rsi 0000000000000000 rdi 0000560156a8e3d8 rbp 00007ffe083adb20 rsp 00007ffe083adaf8
r8 0000000000000000 r9 0000000000000000 r10 0000000000000000 r11 0000000000000246
r12 0000560156a8e3d0 r13 000056014e4bf640 r14 0000000000000000 r15 0000000000000000
01 f0 ff ff 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84
00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c
8b 4c 24 08 0f 05
--> 48 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 54 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66
0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 0b 00 00 00 0f 05 48 3d 01 f0 ff ff 73 01 c3 48
8b 0d 6d 54 38 00 f7 d8
Hope it helps!
-René
We also meet the same problem, even with transparent hugepages disabled. It seems that this abnormal stuck only raise in 2D classification but not to other jobs. And we check the job.log when stuck happened and find the following informations:
“cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 6 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected”
No problem - sorry for the long output, we have a handful of cluster lanes that existed before the ability to set custom cluster lane variables (in this case, ram_gb_multiplier) and were kept as sensible defaults, as well as special lanes for Cryosparc Live, Jobs needing large cache space, and jobs using a different caching location.
$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a', 'lane': 'MaRC3a', 'name': 'MaRC3a', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## {{ ram_gb_multiplier }} - custom memory multiplier\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem-per-cpu={{ (ram_gb|float*1000*(ram_gb_multiplier|default(1))|float)|int }}MB \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --constraint="[ssd_2tb|ssd_8tb]"\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\nexport OMP_NUM_THREADS=$SLURM_NTASKS\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'ram_gb_multiplier', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_CSLive', 'lane': 'MaRC3a_CSLive', 'name': 'MaRC3a_CSLive', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparclive_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb*3000)|int }}MB \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro\n#SBATCH --reservation=cryosparclive\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_CSLive', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_largemem', 'lane': 'MaRC3a_largemem', 'name': 'MaRC3a_largemem', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem-per-cpu={{ (ram_gb*3125)|int }}MB \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_largemem', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_smallmem', 'lane': 'MaRC3a_smallmem', 'name': 'MaRC3a_smallmem', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb*2000)|int }}MB \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_smallmem', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch_shared/cryosparkuser', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_sharedcache', 'lane': 'MaRC3a_sharedcache', 'name': 'MaRC3a_sharedcache', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb|float*1000*(ram_gb_multiplier|default(1))|float)|int }}MB \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_sharedcache', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'ram_gb_multiplier', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_largessd', 'lane': 'MaRC3a_largessd', 'name': 'MaRC3a_largessd', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## {{ ram_gb_multiplier }} - custom memory multiplier\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }},ssdexcl:1\n#SBATCH --mem-per-cpu={{ (ram_gb|float*1000*(ram_gb_multiplier|default(1))|float)|int }}MB \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --constraint=ssd_8tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_largessd', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'ram_gb_multiplier', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]
Additional information: Streaming 2D Classification in Cryosparc live is running normally so far
I had the same problem in Version 4.6.0, but downgrading to 4.5.3 solved it.
Me too, downgrading to 4.5.3 solved it.
We also have problems in v4.6.0 in some projects with 2D classification jobs that run “forever”. The only common feature of these problematic 2D classification jobs that I’ve noticed so far is a large number of apparently random noise 2D classes. However, the user of one of these projects told me, that 2D classification finished quickly before v4.6.0.
Therefore, I’m also considering to downgrade to v4.5.3. I still hope that the developers will come with a patch that will solve this issue, soon.
I confirm the bug.
Our users just make workaround: divide particles to sets with less items and run them separately.
@dirk @maxim Do you submit your CryoSPARC jobs to a cluster? If so, please can you
- post the output of the command
cryosparcm cli "get_scheduler_targets()"
- indicate the name of the relevant cluster scheduler lane (as configured inside CryoSPARC)
Yes, the users submit jobs to a SLURM queuing system with three queues, 2gpu, 4gpu, 6gpu, with the number of available GPUs per SLUM node in the queue name. The abnormally long jobs were running on the 4gpu and 6gpu queues. The longest 2D classification with v4.6.0 ran on the 4gpu queue for ~120h with less than1 Mio particles.
Here is the output of the get_scheduler_targets command:
$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/scratch', 'cache_quota_mb': 3400000, 'cache_reserve_mb': 10240, 'custom_var_names': ['slurmnode'], 'custom_vars': {}, 'desc': None, 'hostname': '4gpu', 'lane': '4gpu', 'name': '4gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash -e\n#\n# cryoSPARC script for SLURM submission with sbatch\n#\n# 24-01-2022 Dirk Kostrewa Original file\n# 23-05-2023 Dirk Kostrewa Added cluster submission script variable "slurmnode"\n# 01-03-2024 Dirk Kostrewa cgroups: no CUDA_VISIBLE_DEVICES, no --gres-flags=enforce-binding\n# 06-03-2024 Dirk Kostrewa Double memory allocation in "--mem="\n\n#SBATCH --partition=4gpu\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --mem={{ (ram_gb*2)|int }}G\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mail-type=NONE\n#SBATCH --mail-user={{ cryosparc_username }}\n#SBATCH --nodelist={{ slurmnode }}\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm 4gpu', 'tpl_vars': ['job_log_path_abs', 'cluster_job_id', 'project_uid', 'ram_gb', 'cryosparc_username', 'num_gpu', 'num_cpu', 'run_cmd', 'command', 'slurmnode', 'job_uid'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparc/cryosparc/cryosparc_worker/bin/cryosparcw'},
{'cache_path': '/scratch', 'cache_quota_mb': 3400000, 'cache_reserve_mb': 10240, 'custom_var_names': ['slurmnode'], 'custom_vars': {}, 'desc': None, 'hostname': '6gpu', 'lane': '6gpu', 'name': '6gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash -e\n#\n# cryoSPARC script for SLURM submission with sbatch\n#\n# 24-01-2022 Dirk Kostrewa Original file\n# 23-05-2023 Dirk Kostrewa Added cluster submission script variable "slurmnode"\n# 01-03-2024 Dirk Kostrewa cgroups: no CUDA_VISIBLE_DEVICES, no --gres-flags=enforce-binding\n# 06-03-2024 Dirk Kostrewa Double memory allocation in "--mem="\n\n#SBATCH --partition=6gpu\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --mem={{ (ram_gb*2)|int }}G\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mail-type=NONE\n#SBATCH --mail-user={{ cryosparc_username }}\n#SBATCH --nodelist={{ slurmnode }}\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm 6gpu', 'tpl_vars': ['job_log_path_abs', 'cluster_job_id', 'project_uid', 'ram_gb', 'cryosparc_username', 'num_gpu', 'num_cpu', 'run_cmd', 'command', 'slurmnode', 'job_uid'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparc/cryosparc/cryosparc_worker/bin/cryosparcw'},
{'cache_path': '/scratch', 'cache_quota_mb': 3400000, 'cache_reserve_mb': 10240, 'custom_var_names': ['slurmnode'], 'custom_vars': {}, 'desc': None, 'hostname': '2gpu', 'lane': '2gpu', 'name': '2gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/bin/bash -e\n#\n# cryoSPARC script for SLURM submission with sbatch\n#\n# 24-01-2022 Dirk Kostrewa Original file\n# 23-05-2023 Dirk Kostrewa Added cluster submission script variable "slurmnode"\n# 01-03-2024 Dirk Kostrewa cgroups: no CUDA_VISIBLE_DEVICES, no --gres-flags=enforce-binding\n# 06-03-2024 Dirk Kostrewa Double memory allocation in "--mem="\n\n#SBATCH --partition=2gpu\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task={{ num_cpu }}\n#SBATCH --mem={{ (ram_gb*2)|int }}G\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mail-type=NONE\n#SBATCH --mail-user={{ cryosparc_username }}\n#SBATCH --nodelist={{ slurmnode }}\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'slurm 2gpu', 'tpl_vars': ['job_log_path_abs', 'cluster_job_id', 'project_uid', 'ram_gb', 'cryosparc_username', 'num_gpu', 'num_cpu', 'run_cmd', 'command', 'slurmnode', 'job_uid'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparc/cryosparc/cryosparc_worker/bin/cryosparcw'}]
@dirk Thanks for sharing the scheduler target information. Please can you also let us know
- the output of the command
on your cluster worker nodescat /sys/kernel/mm/transparent_hugepage/enabled
- whether CPU resources available to jobs are constrained according to the
parameter#SBATCH --cpus-per-task=
We are still looking into this. One other thing which anyone encountering this issue could do to help us is to share your /etc/slurm/slurm.conf
and /etc/slurm/cgroup.conf
files with us (or the corresponding ones, if located elsewhere). They can me DM’d to me or posted in this thread. There may be some dependence on certain details of cluster configuration.
Thanks
Dear wtempel and hsnyder,
- The output of the command “cat /sys/kernel/mm/transparent_hugepage/enabled” is: [always] madvise never
I do not plan to change this, since cryoSPARC worked well with this parameter in v4.5.3 and no user who reported that this parameter was changed had any success with it. - I use “#SBATCH --cpus-per-task={{ num_cpu }}” in our cluster_script.sh and “ConstrainCores=yes” in cgroup.conf - again, this worked well with v4.5.3.
I will send our SLURM configuration files in a separate direct message to hsnyder.
Meanwhile, the complaints of our users force me to downgrade to v4.5.3. When the issues with the current v4.6.0 have been solved, I will upgrade cyroSPARC again.
Best regards,
Dirk
For troubleshooting job stalls as those described in this topic, we still recommend testing whether the stall is resolved by disabling THP.
It is unclear to us at this time whether the new IO subsystem introduced in v4.6.0 is affected by THP differently than earlier implementations of the IO subsystem.
For stalls observed for GPU-accelerated CryoSPARC jobs on a CryoSPARC instance where
- the job was queued to a workload manager like slurm,
- CPU resources available to the job are actually restricted to the resources requested , and,
- the stall is not resolved by disabling THP:
one may try to increase the number of requested CPUs. We received feedback that the increase resolved the stall for jobs submitted with a modified script template.
For example, one might replace a line
#SBATCH --cpus-per-task={{ num_cpu }}
in the current template
with
{% set increased_num_cpu = 8 -%}
#SBATCH --cpus-per-task={{ [1, num_cpu, [increased_num_cpu*num_gpu, increased_num_cpu]|min]|max }}
This recommendation is based on user feedback. Please be aware that we have not yet reproduced stalls that were resolved by an increase of the number of requested CPUs beyond {{ num_cpu }}
. Our recommendation to increase the number of requested CPUs is subject to change, pending user feedback and our own testing. Please report your observations regarding the
--cpus-per-task=
or equivalent parameter in this thread.
Special considerations may apply to GPU-accelerated VMs in the cloud that are used with a cluster workload manager like slurm. A suitable cloud-based VM may have (just) the required number of virtual cores, but custom VM and/or workload manager settings may or may not be required in order for those virtual cores to be “recognized” for the purpose of CPU allocations.
Disabling THP requires a reboot of all GPU servers in our SLURM cluster, and this would affect other software users as well.
Anyway, I had to downgrade cryoSPARC to v4.5.3 in order to restore a working cryoSPARC environment for our more than 30 users.
Best regards,
Dirk
Just out of curiosity/naiveté, why does switching from madvise
to never
with transparent_hugepages
require a reboot in your cluster environment? Are the worker systems PXE booting an immutable environment? Thanks in advance.
A reboot is only required if you want to completely get rid of THP, which is what I would have done (see here and jump to “To disable THP at run time”).
Best regards,
Dirk
If you’re currently set to madvise
, you can add the following line to your cryosparc_master/config.sh
and cryosparc_worker/config.sh
, which may help:
export NUMPY_MADVISE_HUGEPAGE=0
This will tell numpy
not to request hugepages (which it does by default). This will not work if the system wide THP setting is always
. We are considering making this the default in the future, given the number of users who have had hugepage related problems.
This is not true. Many users in this thread reported that disabling THP did not fix it, which lead to the discovery of another category of stalls related to SLURM. Of all the stall reports related to v4.6, most have been resolved for users who disable THP. We have also reproduced the THP related stalls internally, they are very reliably reproducible.
My recommendation is still that the first thing tried by anyone experiencing stalls should be to disable THP, either system wide, or, if that is undesirable and if the system-wide setting is madvise
, via the environment variable I mentioned above.