I’m running a particle subtraction job on about 500k particles. About halfway through the job, it seems to stall, with the final step saying “Processed 254667 of 479016 images.” The job doesn’t display any error, but keeps running. I have already restarted both the job and the cluster this job is running on, after it previously had the same issue. It also stopped about halfway through last time, though not at exactly the same number of images.
When you observe a job stall, please can you
- run these commands on the relevant worker node:
and post their outputsuname -a free -h cat /sys/kernel/mm/transparent_hugepage/enabled - post the 10 latest lines from the job log (under Metadata|Log in the UI for that job)
- if you are using an external cluster manager (slurm or similar), stdout and stderr for the cluster job.
Here’s what we get with those 3 commands:
(almalinux 9.6) root@dcc-dhvi-strucbio-gpu-03 [production] ~ # uname -a
Linux dcc-dhvi-strucbio-gpu-03 5.14.0-570.22.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 19 08:10:32 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 304Gi 12Gi 3.7Gi 551Mi 291Gi 292Gi
Swap: 3.9Gi 254Mi 3.7Gi
(almalinux 9.6) root@dcc-dhvi-strucbio-gpu-03 [production] ~ # cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
For the 10 last lines (I restarted the job again, so the exact stopping point is different than before):
========= sending heartbeat at 2026-01-30 13:36:29.710564
========= sending heartbeat at 2026-01-30 13:36:39.734717
========= sending heartbeat at 2026-01-30 13:36:49.759870
========= sending heartbeat at 2026-01-30 13:36:59.783926
========= sending heartbeat at 2026-01-30 13:37:09.812326
========= sending heartbeat at 2026-01-30 13:37:19.833431
========= sending heartbeat at 2026-01-30 13:37:29.857872
========= sending heartbeat at 2026-01-30 13:37:39.875926
========= sending heartbeat at 2026-01-30 13:37:49.900818
========= sending heartbeat at 2026-01-30 13:37:59.926769
Before these, the last 10 lines would have been:
<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
gpufft: creating new cufft plan (plan id 2 pid 45460)
gpu_id 0
ndims 3
dims 400 400 400
inembed 400 400 402
istride 1
idist 64320000
onembed 400 400 201
ostride 1
odist 32160000
batch 1
type R2C
wkspc automatic
Python traceback:
========= sending heartbeat at 2026-01-30 12:59:13.995157
/hpc/group/dhvi-strucbio/cryostrucbio3/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(963): warning #177-D: variable "Nb2p1" was declared but never referenced
Thanks @ajm .
Was the job stalled again at
?
The transparent_hugepage.enabled setting always may lead to stalled job.
Please can you test whether the stalls still occur when the setting is changed to madvise or never?
The job was stalled again at the repeated sending heartbeat steps.
We ran:
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
and restarted the job. It has again stalled at 261667 of 479016 images and the most recent log steps are all again sending heartbeat.
Since then, we tried splitting the particle stack into 5 and running separate subtraction jobs. Each small job was able to complete on its own.
Thanks @ajm for the update. We are not sure about the cause of the stall.
We are glad you found this workaround. Thanks for posting.