Hello Cryosparc team:
In our lab, we are currently running CryoSPARC v4.6.2 on an Exxact workstation and have been experiencing recurring job failures during Non-Uniform Refinement and Homogeneous Refinement steps — even when enabling low-memory mode and caching particles on the SSD.
CryoSPARC jobs frequently fail with the following error messages:
“Job process terminated abnormally.”
“Job is unresponsive—no heartbeat received in 180 seconds.”
“DIE: allocate: out of memory (reservation insufficient)”
These errors commonly occur during simultaneous job executions. We’ve attached a document (document) outlining the system specifications and examples of failed jobs. In the document, Table A and Table B represent two groups of jobs that were run concurrently and experienced failures.
Currently, I am the only active user running cryo-EM jobs on this workstation, and in practice, I can only run one job at a time due to resource limitations. However, two additional graduate students will soon begin their own cryo-EM analyses. As such, we are exploring ways to enhance the system’s capacity and performance to support multiple users and enable efficient, parallel CryoSPARC job execution.
How many GPUs are in that system? I presume four from other comments in the document?
Abnormal job termination can have a host of reasons, not always obvious (check dmesg if possible) but on RHEL-based distros you’ll need root permissions to do so. CUDA_ALLOCATION errors are lack of VRAM, you’ll need larger GPUs for those jobs, sometimes Low memory mode is enough to squeak through, sometimes not. A6000s or whatever. DIE: allocate insufficient errors are lack of system RAM, which can be seen by total assigned RAM after the job has died.
What box sizes are you running? How many particles are you using in, e.g. 3DVA jobs (3DVA has been reworked to need a lot less memory than it did, but it still needs a lot!) and higher resolution 3DVA needs even more (4 Å is high resolution…) NU refine jobs which run out of memory are likely large box sizes - box sizes larger than ~1000 pixels will happily eat 400+GB of system RAM in NU refine.
Basically, 256GB is nowhere near sufficient for some of the more advanced features in CryoSPARC if the system is multiuser.
For a single user who is aware of their system, 512GB RAM and a pair of A6000s is solid. For multiuser, you’re looking at wanting 1TB of RAM. For general processing pipelines, we have a 10 GPU box which is usually running Patch Motion, Patch CTF, picking, 2D/3D classification and homogeneous/hetrogeneous or NU refinements and everything is happy, but when one user wants to do 3D flex or RBMC of a large box, or NU refinement of large boxes, it’s a case of asking others not run anything else for a bit. This is with 1TB of system RAM.
Hey,
Thanks a lot for your reply.
In my typical refinements, I use a box size of 600–700 pixels with around 150,000 particles. When we first got the machine, I was able to run up to four NU refinement jobs in parallel (with a 600-pixel box size) without any issues. However, more recently, I can usually only run one or two NU refinement jobs at a time—attempting more often results in job failures. These failures seem to be related to memory usage, especially when the “Cache particle images on SSD” option is turned off. When I try enabling it, CryoSPARC reports insufficient memory. I also typically use “Low memory mode” for these jobs.
Our system has 4 GPUs, 256 GB of RAM, and a 2 TB SSD. Based on the errors and the system behavior, I’m wondering if upgrading the SSD or expanding our scratch storage might help improve performance and stability. We’re currently considering system upgrades. I’m not sure if expanding the RAM is feasible at the moment, but I’ll check on that.
In the meantime, I’d really appreciate your insight on how we can best optimize our setup. Would prioritizing a storage/SSD upgrade provide significant benefit?
What version of CryoSPARC were you running when you were able to run four NU refinements simultaneously (and what box sizes)? v4.4 was when the transition to the faster codepath for NU refine occurred, and system requirements increased dramatically as a result.
More storage is always a good idea for cryo-EM image processing. With multiple users, you’ll be surprised how fast 100TB disappears! (Case in point: three postdocs filled 100TB of storage with CryoSPARC processing in less than six weeks, although I’ll admit that two projects have a lot of datasets involved…)
As far as I can tell, scratch space shouldn’t have any impact on memory usage, if you’re experiencing sudden instability in certain jobs, I’d check dmesg for any memory related issues, and perhaps run a few passes of Memtest/mprime (mixed) and see whether you get any fails.
If you’re going multiuser on that system, however, absolutely increase storage. A lot. Unless it’ll be a “per dataset, per person, per time” strategy, and detach and cold storage each dataset as it’s finished, you’ll be constantly fighting space issues.