GPU jobs hang at “Processing …” (CryoSPARC v4.7.1, RTX A6000)

Hi,

I’ve been running CryoSPARC v4.7.1 without issues, but suddenly all GPU jobs (Patch Motion, CTF estimation, etc.) now hang right after “Processing …”. The jobs start, allocate some GPU memory, and then just sit there with no progress or GPU utilization. CPU-only jobs run fine.

I already tried uninstalling and reinstalling CryoSPARC, but the issue persists.

System details:

  • CryoSPARC v4.7.1
  • 8× NVIDIA RTX A6000 (49 GB VRAM each)
  • NVIDIA driver 575.57.08, CUDA 12.9

This is the output of the patch motion correction job that I am running which just gets stuck here (the Transparent hugepages was always enabled but never caused any problems before):

License is valid.

Launching job on lane default target zhian-srv1.ece.gatech.edu …

Running job on master node hostname zhian-srv1.ece.gatech.edu

[CPU: 91.3 MB Avail: 483.37 GB]
Job J2 Started

[CPU: 91.3 MB Avail: 483.37 GB]
Master running v4.7.1, worker running v4.7.1

[CPU: 91.7 MB Avail: 483.37 GB]
Working in directory: /usr/scratch/CryoEM/cryoSPARC/empiar_12111/CS-empiar-12111-2/J2

[CPU: 91.7 MB Avail: 483.37 GB]
Running on lane default

[CPU: 91.7 MB Avail: 483.37 GB]
Resources allocated:

[CPU: 91.7 MB Avail: 483.37 GB]
Worker: zhian-srv1.ece.gatech.edu

[CPU: 91.7 MB Avail: 483.37 GB]
CPU : [0, 1, 2, 3, 4, 5]

[CPU: 91.7 MB Avail: 483.37 GB]
GPU : [0]

[CPU: 91.7 MB Avail: 483.37 GB]
RAM : [0, 1]

[CPU: 91.7 MB Avail: 483.37 GB]
SSD : False

[CPU: 91.7 MB Avail: 483.37 GB]

[CPU: 91.7 MB Avail: 483.37 GB]
Importing job module for job type patch_motion_correction_multi…

[CPU: 261.7 MB Avail: 483.25 GB]
Job ready to run

[CPU: 261.7 MB Avail: 483.25 GB]


[CPU: 261.7 MB Avail: 483.26 GB]
Transparent hugepages are enabled. You may encounter stalls or performance problems with CryoSPARC jobs.

[CPU: 261.7 MB Avail: 483.25 GB]
Job will process this many movies: 300

[CPU: 261.7 MB Avail: 483.25 GB]
Job will output denoiser training data for this many movies: 200

[CPU: 261.7 MB Avail: 483.25 GB]
Random seed: 1120776722

[CPU: 261.7 MB Avail: 483.25 GB]
parent process is 3814530

[CPU: 175.8 MB Avail: 483.24 GB]
Calling CUDA init from 3814571

[CPU: 321.0 MB Avail: 483.11 GB]
– 0.0: processing 1 of 300: J1/imported/002838475534287101175_mic__May21_12.30.09.tif
loading /usr/scratch/CryoEM/cryoSPARC/empiar_12111/CS-empiar-12111-2/J1/imported/002838475534287101175_mic__May21_12.30.09.tif
Loading raw movie data from J1/imported/002838475534287101175_mic__May21_12.30.09.tif …
Done in 0.48s
Processing …

[CPU: 92.1 MB Avail: 482.32 GB]
WARNING: io_uring support disabled (not supported by kernel), I/O performance may degrade

Thanks

Fixed.

Summary:
CryoSPARC GPU jobs (e.g. Patch Motion Correction) were hanging indefinitely. The GPU memory was allocated in nvidia-smi, but the job never progressed. This only affected CryoSPARC; other frameworks like PyTorch could train models on the same GPUs without issue, so the hardware/drivers were fine.

After digging into it, the problem turned out to be the Numba/llvmlite versions inside CryoSPARC’s worker environment. The versions installed by cryosparcw forcedeps were not compatible with our NVIDIA driver (575.57.08 / CUDA 12.9). That mismatch caused any Numba CUDA kernel to freeze at cuda.synchronize(), which explains why CryoSPARC GPU jobs hung.

Fix:
In the cryosparc_worker_env, downgrade and pin Numba/llvmlite:

conda activate cryosparc_worker_env
python -m pip install --no-cache-dir "numba==0.58.1" "llvmlite==0.41.1"

Then lock them so they aren’t overwritten:

printf "numba ==0.58.1\nllvmlite ==0.41.1\n" > $CONDA_PREFIX/conda-meta/pinned

After restarting CryoSPARC, all GPUs became available again and Patch Motion jobs ran successfully.

1 Like

Welcome to the forum @zain.shabeeb and thanks for sharing your findings.

We were unfortunately unable to reproduce the GPU hang in our testing (with a different setup and test case):

  • CryoSPARC v4.7.1+250814, including llvmlite-0.42.0 and numba-0.59.1 (via conda-forge)
  • nvidia driver 575.51.03
  • an older GPU model
  • patch motion correction of 14sep05c_c_00003gr_00014sq_00010hl_00002es.frames.tif from the EMPIAR-10025 tiff-formatted subset.

We are not sure how the error arose on your specific combination of GPU model, driver installation and CryoSPARC version, and what the side effects (and eventual fate during a CryoSPARC update) of the downgraded, pip-installed llvmlite and numba packages may be.

Alternative v4.7.1-cuda12 packages come with llvmlite-0.44.0 and numba-0.61.2 and might work under your circumstances without having to customize individual packages.