Ab-init running forever

cameronf · April 1, 2026, 6:46pm

We’ve got Ab-Init jobs running for days - but they don’t seem to be doing anything - the output in the logs is only heartbeats, they don’t progress past Iteration 0 - there are some strange warnings in the logs that feel concerning - and then nothing but heartbeats after that….

2026-03-30 10:57:25,193 core run_with_executo INFO | SSD cache complete2026-03-30 10:57:26,301 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:57:36,325 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:57:46,350 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:57:56,374 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:58:06,398 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:58:16,423 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:58:26,447 core heartbeat INFO | ========= Updating heartbeat
WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade
gpufft: creating new cufft plan (plan id 0 pid 529260)
gpu_id 0
ndims 2
dims 256 256 0
inembed 256 256 0
istride 1
idist 65536
onembed 256 256 0
ostride 1
odist 65536
batch 10
type C2C
wkspc automatic
Python traceback:

HOST ALLOCATION FUNCTION: using numba.cuda.pinned_array
/standard/takcryoem/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(35): warning #68-D: integer conversion resulted in a change of sign
my_nan_count += __shfl_xor_sync(-1, my_nan_count, x);
^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

kernel(44): warning #68-D: integer conversion resulted in a change of sign
my_nan_count += __shfl_xor_sync(-1, my_nan_count, x);
^

kernel(17): warning #177-D: variable "N_I" was declared but never referenced
unsigned N_I = gridDim.x;
^

warnings.warn(msg)
/standard/takcryoem/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/threading.py:1075: RuntimeWarning: divide by zero encountered in scalar divide my_nan_count += __shfl_xor_sync(-1, my_nan_count, x); self.run()/standard/takcryoem/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/threading.py:1075: RuntimeWarning: invalid value encountered in scalar divide self.run()/standard/takcryoem/cryosparc/cryosparc_worker/cli/cryosparcw.py:287: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected return run(conf)/standard/takcryoem/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/threading.py:1075: RuntimeWarning: divide by zero encountered in scalar divide self.run()/standard/takcryoem/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/threading.py:1075: RuntimeWarning: invalid value encountered in scalar divide self.run()/standard/takcryoem/cryosparc/cryosparc_worker/cli/cryosparcw.py:287: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected return run(conf)2026-03-30 10:58:36,472 core heartbeat INFO | ========= Updating heartbeatgpufft: creating new cufft plan (plan id 1 pid 529260) gpu_id 0 ndims 2 dims 256 256 0 inembed 256 256 0 istride 1 idist 65536 onembed 256 256 0 ostride 1 odist 65536 batch 90 type C2C wkspc automatic Python traceback:/standard/takcryoem/cryosparc/cryosparc_worker/cli/cryosparcw.py:287: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected return run(conf)2026-03-30 10:58:46,496 core heartbeat INFO | ========= Updating heartbeat2026-03-30 10:58:56,521 core heartbeat INFO | ========= Updating heartbeat2026-03-30 10:59:06,546 core heartbeat INFO | ========= Updating heartbeat

wtempel · April 1, 2026, 7:27pm

@cameronf Please can you post additional job information (replacing P99 and J199 with the actual IDs of a reconstruction job that is stuck)

project_uid="P99"
job_uid="J199"
cryosparcm joblog $project_uid $job_uid | head -n 30
cryosparcm eventlog $project_uid $job_uid | head -n 30
cryosparcm eventlog $project_uid $job_uid | tail -n 30

cameronf · April 1, 2026, 7:49pm

You bet…

~$ cryosparcm joblog $project_uid $job_uid | head -n 30DeprecationWarning: The command 'joblog' is deprecated.⚠ "cryosparcm joblog" is deprecated. Use "cryosparcm job log" instead================= CRYOSPARC =================Project P4 Job J412Master udc-an33-12c0 Port 60100===========================================================================MAIN PROCESS PID 5292602026-03-30 10:50:03,927 core monitor INFO | MONITOR PROCESS PID 5305802026-03-30 10:50:05,268 core monitor INFO | ========= monitor process now waiting for main process2026-03-30 10:50:05,268 core heartbeat INFO | ========= Updating heartbeat================= CRYOSPARC =================Project P4 Job J412Master udc-an33-12c0 Port 60100===========================================================================MAIN PROCESS PID 529260========= updating job startup information at 2026-03-30 10:50:55.0374952026-03-30 10:51:05,227 core heartbeat INFO | ========= Updating heartbeat2026-03-30 10:51:15,324 core heartbeat INFO | ========= Updating heartbeat2026-03-30 10:51:25,349 core heartbeat INFO | ========= Updating heartbeat========= now starting main process at 2026-03-30 10:51:11.1181672026-03-30 10:51:29,488 core run INFO | Running job J412 of type homo_abinit2026-03-30 10:51:29,488 core run INFO | Running job on hostname slurm-any-gpu2026-03-30 10:51:29,493 core run INFO | Allocated Resources: lane='slurm-any-gpu' lane_type='cluster' hostname='slurm-any-gpu' target=SchedulerTarget(cache_path='/scratch/npa2pc/cryosparc_cache', cache_reserve_mb=None, cache_quota_mb=None, lane='slurm-any-gpu', name='slurm-any-gpu', title='slurm-any-gpu', desc=None, hostname='slurm-any-gpu', worker_bin_path='/standard/takcryoem/cryosparc/cryosparc_worker/bin/cryosparcw', config=Cluster(send_cmd_tpl='{{ command }}', qsub_cmd_tpl='/opt/slurm/current/bin/sbatch {{ script_path_abs }}', qstat_cmd_tpl='/opt/slurm/current/bin/squeue -j {{ cluster_job_id }}', qdel_cmd_tpl='/opt/slurm/current/bin/scancel {{ cluster_job_id }}', qinfo_cmd_tpl='/opt/slurm/current/bin/sinfo', type='cluster', script_tpl='#!/bin/bash\n#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH --comment="created by {{ cryosparc_username }}"\n#SBATCH --output={{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm.out\n#SBATCH --error={{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm.err\n#SBATCH --ntasks={{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --partition=gpu\n#SBATCH --mem={{ (ram_gb*2)|int }}G\n#SBATCH --time=3-00:00:00\n\n{{ run_cmd }}\n\n', custom_vars={}, tpl_vars=['cryosparc_username', 'ram_gb', 'job_dir_abs', 'num_cpu', 'project_uid', 'num_gpu', 'command', 'job_uid', 'run_cmd', 'cluster_job_id'], custom_var_names=[])) slots=ResourceSlots(CPU=[0, 1], GPU=[0], RAM=[0]) fixed=FixedResourceSlots(SSD=True) licenses_acquired=12026-03-30 10:51:35,373 core heartbeat INFO | ========= Updating heartbeat #SBATCH --comment="created by uday@virginia.edu"2026-03-30 10:51:45,397 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:51:55,421 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:52:05,445 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:52:15,470 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:52:25,495 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:52:35,520 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:52:45,544 core heartbeat INFO | ========= Updating heartbeat
2026-03-30 10:52:55,569 core heartbeat INFO | ========= Updating heartbeat
~$ cryosparcm eventlog $project_uid $job_uid | head -n 30
DeprecationWarning: The command 'eventlog' is deprecated.
⚠ "cryosparcm eventlog" is deprecated. Use "cryosparcm job events" instead
[2026-03-30 14:45:27] [166 MB] License is valid.
[2026-03-30 14:45:27] [166 MB] Launching job on lane slurm-any-gpu target slurm-any-gpu ...
[2026-03-30 14:45:28] [166 MB] Launching job on cluster slurm-any-gpu
[2026-03-30 14:45:28] [166 MB] template args: {
"project_uid": "P4",
"job_uid": "J412",
"job_creator": "uday_tak",
"cryosparc_username": "uday@virginia.edu",
"project_dir_abs": "/standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008",
"job_dir_abs": "/standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008/J412",
"job_log_path_abs": "/standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008/J412/job.log",
"job_type": "homo_abinit",
"worker_bin_path": "/standard/takcryoem/cryosparc/cryosparc_worker/bin/cryosparcw",
"num_gpu": 1,
"num_cpu": 2,
"ram_gb": 8,
"run_cmd": "/standard/takcryoem/cryosparc/cryosparc_worker/bin/cryosparcw run --project P4 --job J412
--master udc-an33-12c0 --port 60100 --timeout 20000 --auth >> /standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008/J412/job.log 2>&1 ",
"run_args": "--project P4 --job J412 --master udc-an33-12c0 --port 60100 --timeout 20000 --auth",
"script_path_abs": "/standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008/J412/queue_sub_script.sh",
"cluster_job_id": null
}
[2026-03-30 14:45:28] [166 MB]
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P4_J412
#SBATCH --comment="created by uday@virginia.edu"
#SBATCH --output=/standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008/J412/P4_J412_slurm.out
#SBATCH --error=/standard/takcryoem/Datasets/cryo008_btcap14/CS-cryo008/J412/P4_J412_slurm.err
~$ cryosparcm eventlog $project_uid $job_uid | tail -n 30
DeprecationWarning: The command 'eventlog' is deprecated.
⚠ "cryosparcm eventlog" is deprecated. Use "cryosparcm job events" instead
[2026-03-30 14:58:30] [2915 MB] Generating random initial density for class 2
[2026-03-30 14:58:31] [2923 MB] Generating random initial density for class 3
[2026-03-30 14:58:31] [2868 MB] Generating random initial density for class 4
[2026-03-30 14:58:32] [2863 MB] Done in 10.531s.
[2026-03-30 14:58:32] [FIGURE] Generated initialization!
[asset file="J412_generated_initialization.png" id="69ca8f9834a56d511811435b"]
[asset file="J412_generated_initialization.pdf" id="69ca8f9834a56d511811435d"]
[2026-03-30 14:58:32] [2888 MB] Applying spherical window to 211.07A diameter (falloff to 245.84A)
[2026-03-30 14:58:32] [2888 MB] ( Radius 0.85 to 0.99 )
[2026-03-30 14:58:33]
[2026-03-30 14:58:38] [FIGURE] Structure for Class 000 Iteration 0
[asset file="J412_structure_for_class_000_iteration_0.png" id="69ca8f9d34a56d5118114379"]
[asset file="J412_structure_for_class_000_iteration_0.pdf" id="69ca8f9e34a56d511811437b"]
[2026-03-30 14:58:38] [FIGURE] Structure for Class 001 Iteration 0
[asset file="J412_structure_for_class_001_iteration_0.png" id="69ca8f9e34a56d511811437e"]
[asset file="J412_structure_for_class_001_iteration_0.pdf" id="69ca8f9e34a56d5118114380"]
[2026-03-30 14:58:39] [FIGURE] Structure for Class 002 Iteration 0
[asset file="J412_structure_for_class_002_iteration_0.png" id="69ca8f9e34a56d5118114383"]
[asset file="J412_structure_for_class_002_iteration_0.pdf" id="69ca8f9f34a56d5118114385"]
[2026-03-30 14:58:39] [FIGURE] Structure for Class 003 Iteration 0
[asset file="J412_structure_for_class_003_iteration_0.png" id="69ca8f9f34a56d5118114388"]
[asset file="J412_structure_for_class_003_iteration_0.pdf" id="69ca8f9f34a56d511811438a"]
[2026-03-30 14:58:39] [FIGURE] Structure for Class 004 Iteration 0
[asset file="J412_structure_for_class_004_iteration_0.png" id="69ca8f9f34a56d511811438d"]
[asset file="J412_structure_for_class_004_iteration_0.pdf" id="69ca8f9f34a56d511811438f"]
[2026-03-30 14:58:39] [3047 MB] ----------- Iteration 0 (epoch 0.000). radwn 20.69 resolution 12.00A minisize 90 beta 0.10
[2026-03-30 14:58:42] [3207 MB] Estimating noise model to meet target of 98.17 poses..
Current ESS R is 8924.39 poses..
Current sigma is 0.10 New sigma is 0.10

wtempel · April 1, 2026, 8:33pm

Thanks @cameronf . The preamble of your job log is longer than I expected, so the excerpt is still missing some information. Please can you post the output of the command

cryosparcm job log P4 J412 | grep hugepage

cameronf · April 1, 2026, 9:10pm

I’m assuming you want to see this:

Transparent hugepages setting: [always] madvise never

As well, I see a warning in the UI:

Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs.

Which obviously sounds suspicious

wtempel · April 1, 2026, 10:12pm

It is possible that changing transparent_hugepage.enabled on the worker node to madvise or never resolves the job stall (discussion). This change requires approval by and cooperation from your cluster admin team (and root privilege on the worker node).

cameronf · April 1, 2026, 10:49pm

I was afraid you’d say that - I’m not sure I’m going to be able to convince IT to do this . Did something change between v4 and v5 that causes this issue? Things were working fine on 4. I may need to downgrade, unless you guys are planning on addressing this in 5….

cameronf · April 9, 2026, 4:29pm

FWIW - I downgraded to 4.7.1 and everything is running smoothly again…. Not sure how long I can stay on that version but…..

wtempel · April 13, 2026, 4:47pm

Given your description of the problem and system settings

we would prioritize testing whether changing transparent_hugepage.enabled to madvise or never
we have no plan to implement a workaround for for the transparent_hugepage.enabled always setting. CryoSPARC is not the only application for which problems with this setting have been reported.

We cannot rule out other causes for the stall. For example, you may test whether including in your cryosparc_worker/config.sh the line

export CRYOSPARC_NO_PAGELOCK="true"

(guide) provides relief from the stalls.