Cryosparc v4.6.0 2D job never finish

I got similar problem too. In my case, it seems be related to GPU over heating. This occurs more often in the jobs with entensitve use of GPU, such as local refinement after symmetry expansion, or 2D classification if only use single GPU. When I used nvidia-smi -l monitor GPU temperature, when GPU temperature got to 87C, the GPU stopped working, and the job is hanging forever. If I limit the power of GPU to about 2/3 of its full power and re-run local refinement, then it can get through.

There’s also a NU-Refine job that shows the same symptoms now.
Job log is stalled at:

[2024-10-08 4:11:42.90] [CPU:  11.60 GB] Starting particle processing for split A..
[2024-10-08 4:11:43.13] [CPU:  11.60 GB] batch 44 of 189

CPU% is 0, with short periodic bursts of 0.7 (as noted above, I suspect these being the heartbeat responses).
RSS is at 13 GB, there is still a large amount of free space on the node.
Interestingly, the Virtual Memory held by the idle process is extremely large at almost 15 TB, I cannot say if this normal behaviour.

In this instance, there were some stacktrace entries in the job log when sending the SIGABRT:

========= sending heartbeat at 2024-10-08 09:49:51.518028
Received SIGABRT (addr=000000000007669d)
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f0
568879953]
/lib64/libpthread.so.0(+0x12cf0)[0x7f05726a4cf0]
/lib64/libc.so.6(syscall+0x1d)[0x7f0571b7c9bd]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(event_wait+0x3f)[0x7f0568879f1f]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(iosys_wait_for_completion+0x19)[0x7f0
568885739]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(wrap_iosys_wait_for_completion+0x51)[
0x7f05688a7c21]
python(+0x1445a6)[0x56014de465a6]
python(_PyObject_MakeTpCall+0x26b)[0x56014de3fa6b]
python(_PyEval_EvalFrameDefault+0x54a6)[0x56014de3b9d6]
python(_PyFunction_Vectorcall+0x6c)[0x56014de46a2c]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/engine/newengine.cpython-310-x86_64-linux-gnu.so(+0x7f
06b)[0x7f053677d06b]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/ctf_refinement/run_local.cpython-310-x86_64-linux
-gnu.so(+0x2929d)[0x7f05353bb29d]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xcd14)[0x7f0572d
43d14]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/ctf_refinement/run_local.cpython-310-x86_64-linux-gnu.so(+0x19455)[0x7f05353ab455]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0xcd14)[0x7f0572d43d14]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/refine/newrun.cpython-310-x86_64-linux-gnu.so(+0x1b0ee)[0x7f0566ad80ee]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/refine/newrun.cpython-310-x86_64-linux-gnu.so(+0x2248ba)[0x7f0566ce18ba]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x20e91)[0x7f0572d57e91]
/home/cryosparkuser/cryosparc/cryosparc_worker/cryosparc_compute/run.cpython-310-x86_64-linux-gnu.so(+0x12c31)[0x7f0572d49c31]
python(_PyEval_EvalFrameDefault+0x4c12)[0x56014de3b142]
python(+0x1d7c60)[0x56014ded9c60]
python(PyEval_EvalCode+0x87)[0x56014ded9ba7]
python(+0x20812a)[0x56014df0a12a]
python(+0x203523)[0x56014df05523]
python(PyRun_StringFlags+0x7d)[0x56014defd91d]
python(PyRun_SimpleStringFlags+0x3c)[0x56014defd75c]
python(Py_RunMain+0x26b)[0x56014defc66b]
python(Py_BytesMain+0x37)[0x56014decd1f7]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f0571b7dd85]
python(+0x1cb0f1)[0x56014decd0f1]
rax fffffffffffffffc  rbx 0000560156a8e3d8  rcx 00007f0571b7c9bd  rdx 0000000000000001  
rsi 0000000000000000  rdi 0000560156a8e3d8  rbp 00007ffe083adb20  rsp 00007ffe083adaf8  
r8  0000000000000000  r9  0000000000000000  r10 0000000000000000  r11 0000000000000246  
r12 0000560156a8e3d0  r13 000056014e4bf640  r14 0000000000000000  r15 0000000000000000  
01 f0 ff ff 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84
00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c
8b 4c 24 08 0f 05
-->   48 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 54 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66
0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 0b 00 00 00 0f 05 48 3d 01 f0 ff ff 73 01 c3 48
8b 0d 6d 54 38 00 f7 d8

Hope it helps!
-René

@sittr Please can you post the output of the command

cryosparcm cli "get_scheduler_targets()"

We also meet the same problem, even with transparent hugepages disabled. It seems that this abnormal stuck only raise in 2D classification but not to other jobs. And we check the job.log when stuck happened and find the following informations:

“cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 6 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected”

No problem - sorry for the long output, we have a handful of cluster lanes that existed before the ability to set custom cluster lane variables (in this case, ram_gb_multiplier) and were kept as sensible defaults, as well as special lanes for Cryosparc Live, Jobs needing large cache space, and jobs using a different caching location.

$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a', 'lane': 'MaRC3a', 'name': 'MaRC3a', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## {{ ram_gb_multiplier }} - custom memory multiplier\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem-per-cpu={{ (ram_gb|float*1000*(ram_gb_multiplier|default(1))|float)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --constraint="[ssd_2tb|ssd_8tb]"\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\nexport OMP_NUM_THREADS=$SLURM_NTASKS\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'ram_gb_multiplier', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_CSLive', 'lane': 'MaRC3a_CSLive', 'name': 'MaRC3a_CSLive', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparclive_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb*3000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro\n#SBATCH --reservation=cryosparclive\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_CSLive', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_largemem', 'lane': 'MaRC3a_largemem', 'name': 'MaRC3a_largemem', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem-per-cpu={{ (ram_gb*3125)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_largemem', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_smallmem', 'lane': 'MaRC3a_smallmem', 'name': 'MaRC3a_smallmem', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb*2000)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_smallmem', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch_shared/cryosparkuser', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_sharedcache', 'lane': 'MaRC3a_sharedcache', 'name': 'MaRC3a_sharedcache', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --mem={{ (ram_gb|float*1000*(ram_gb_multiplier|default(1))|float)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --prefer=ssd_2tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_sharedcache', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'ram_gb_multiplier', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'MaRC3a_largessd', 'lane': 'MaRC3a_largessd', 'name': 'MaRC3a_largessd', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## {{ ram_gb_multiplier }}  - custom memory multiplier\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --nodes=1\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }},ssdexcl:1\n#SBATCH --mem-per-cpu={{ (ram_gb|float*1000*(ram_gb_multiplier|default(1))|float)|int }}MB             \n#SBATCH -o {{ job_dir_abs }}/job.out\n#SBATCH -e {{ job_dir_abs }}/job.err\n#SBATCH --partition=owner_synmikro,normal_gpu\n#SBATCH --constraint=ssd_8tb\n\navailable_devs=""\nfor devidx in $(seq 0 3);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\nexport LC_ALL=C\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'MaRC3a_largessd', 'tpl_vars': ['worker_bin_path', 'run_args', 'project_dir_abs', 'ram_gb', 'num_cpu', 'project_uid', 'job_log_path_abs', 'job_uid', 'job_dir_abs', 'num_gpu', 'cryosparc_username', 'cluster_job_id', 'ram_gb_multiplier', 'job_creator', 'run_cmd', 'command'], 'type': 'cluster', 'worker_bin_path': '/home/cryosparkuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]

Additional information: Streaming 2D Classification in Cryosparc live is running normally so far

I had the same problem in Version 4.6.0, but downgrading to 4.5.3 solved it.

Me too, downgrading to 4.5.3 solved it.

We also have problems in v4.6.0 in some projects with 2D classification jobs that run “forever”. The only common feature of these problematic 2D classification jobs that I’ve noticed so far is a large number of apparently random noise 2D classes. However, the user of one of these projects told me, that 2D classification finished quickly before v4.6.0.
Therefore, I’m also considering to downgrade to v4.5.3. I still hope that the developers will come with a patch that will solve this issue, soon.

I confirm the bug.

Our users just make workaround: divide particles to sets with less items and run them separately.

@dirk @maxim Do you submit your CryoSPARC jobs to a cluster? If so, please can you

  1. post the output of the command
    cryosparcm cli "get_scheduler_targets()"
    
  2. indicate the name of the relevant cluster scheduler lane (as configured inside CryoSPARC)