DIE: allocate: out of memory (reservation insufficient)

I’m trying to troubleshoot a Non-uniform Refinement job that cryosparc submits to our SLURM cluster. On the surface, it looks like an insufficient memory request error. I’ve copied the user’s job and have been trying to walk up increasing memory requests while recording actual memory use with ‘top -bp’ at 1 second intervals. I see the job fail consistently at around 80gb resident memory use and just under 1200gb of virtual memory. I’ve looked at Reference-based motion correction out of memory on cluster - #5 by CleoShen . Our GPU nodes with a100 GPUs have 2 TB memory. However, even increasing the job memory request to 1TB doesn’t help and results in the same error with the same ~80gb of resident and ~1200gb virtual memory use at the time of the error. As far as I know our SLURM configuration doesn’t limit the virtual memory size as far as I know, only the resident memory. The resident memory numbers (80gb used vs 1TB requested) aren’t nearly close to be the reason for the issue. I think the limit CryoSPARC checks for while trying to allocate memory for the next stage must be something else.

Resources allocated:

CPU : [0, 1, 2, 3]
GPU : [0]
RAM : [0, 1, 2]
SSD : False

Input particles have box size 800
Input particles have pixel size 0.7753

Error:

[CPU: 15.06 GB]

====== Starting Refinement Iterations ======
[CPU: 15.06 GB]

----------------------------- Start Iteration 0
[CPU: 15.06 GB]

Using Max Alignment Radius 20.675 (30.000A)
[CPU: 15.06 GB]

Auto batchsize: 14109 in each split
[CPU: 23.12 GB]

– THR 1 BATCH 500 NUM 3500 TOTAL 1618.0549 ELAPSED 4857.4315 –
[CPU: 45.65 GB]

Processed 28218.000 images in 4867.541s.
[CPU: 49.80 GB]

Computing FSCs…
[CPU: 49.80 GB]

Using full box size 800, downsampled box size 400, with low memory mode disabled.
[CPU: 49.80 GB]

Computing FFTs on GPU.
[CPU: 48.83 GB]

Done in 99.086s

[CPU: 48.83 GB]

Using Filter Radius 116.812 (5.310A) | Previous: 20.675 (30.000A)
[CPU: 64.47 GB]

Non-uniform regularization with compute option: GPU
[CPU: 64.47 GB]

Running local cross validation for A …
[CPU: 134.4 MB]

DIE: allocate: out of memory (reservation insufficient)
[CPU: 140.8 MB]

====== Job process terminated abnormally.

Thanks,

Alex

Turn on Low Memory Mode and see whether it still crashes.

It will run a lot slower, but for some reason (Python, or more specifically PyFFTW, I think) CryoSPARC doesn’t scale linearly with memory like RELION does - there are some box sizes that, no matter what GPU they are run on or system RAM present, will crash.

1 Like

Hello Cryospark Developers,

Could I please get your input? I tried the above, plus these parameters on NU Refine job :
7 Custom Parameters:

  1. Cache particle images on SSD: false
  2. Disable auto batchsize: true
  3. Show plots from intermediate steps: false
  4. GPU batch size of images: 1
  5. GPU batch size of images: 1
  6. GPU batch size of images: 1
  7. Low-Memory Mode: true
    The inputs for this refine volume are an imported volume, and particles from Extract Mics (G) job from which I have tried asking for 800-600 px boxes. Nothing has worked. I did notice that implementing the 1-7 parameters above ran the job to ~5hrs. before it crashed. Not doing so (applying Low memory only) the job would crash ~1 hr. The error message has been the same: "Job process terminated abnormally, no outputs.

Time isn’t really important when changing settings like batch size, as decreasing the batch size so much will slow processing down… sometimes dramatically depending on the system. What iteration does it crash on? Is it crashing at a particular resolution (possibly memory capacity issue), or is it generating NaN results (check for corrupt particles if so) or blank volumes (check filter resolution settings)?

System specifications would be helpful; there is a big difference between running on an 8GB GPU (no longer supported, but still works if careful and box sizes not too large), 11GB, and 24GB, for example. Same with system memory; 64GB is not really enough and even 128GB can choke in many situations.

If CryoSPARC itself isn’t reporting any errors, you’ll need to check dmesg and other system logs to see if they contain more information.

Hi @moskalenko, the message out of memory (reservation insufficient) isn’t related to SLURM, it’s from CryoSPARC itself - that section of nonuniform refinement is limited to 1TB of host RAM, which should be sufficient for all cases - in theory, no currently extant GPUs can run a job which would exceed that limit. The fact that you’ve encountered this error may hint at a bug in nonuniform refinement. Have you tried using low-memory mode, as per the suggestion from rbs_sci?

Hi @andres.cuellar, when the job crashes in this way, are there any additional details in the job’s text log? You can access it in the job dialog under the metadata tab (choose “log”), or on the command line by running cryosparcm joblog Pxx Jyy where xx and yy are replaced with the project and job number, respectively.

We are running into a memory issue on our cluster, with both NU Refinements and Local Refinements, that I think might be related to @moskalenko’s issue.
We operate a cluster that uses SLURM, and a mix of a40 and a100 GPUs.
We have a standard memory allocation of 36 Gb for these jobs. I am working with data with a box size of 540, which (if my memory calculations are correct) should only require 1.8 GB of memory. However, when I run the job with 36 GB of memory, it fails with the following error (from the joblog):

========= sending heartbeat at 2024-07-02 16:54:21.348617
gpufft: creating new cufft plan (plan id 6   pid 1684099) 
	gpu_id  0 
	ndims   3 
	dims    540 540 540 
	inembed 540 540 542 
	istride 1 
	idist   158047200 
	onembed 540 540 271 
	ostride 1 
	odist   79023600 
	batch   1 
	type    R2C 
	wkspc   manual 
	Python traceback:

gpufft: creating new cufft plan (plan id 7   pid 1684099) 
	gpu_id  0 
	ndims   3 
	dims    270 270 270 
	inembed 270 270 136 
	istride 1 
	idist   9914400 
	onembed 270 270 272 
	ostride 1 
	odist   19828800 
	batch   1 
	type    C2R 
	wkspc   manual 
	Python traceback:

<string>:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
<string>:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2024-07-02 16:54:31.367279
========= sending heartbeat at 2024-07-02 16:54:41.385938
gpufft: creating new cufft plan (plan id 8   pid 1684099) 
	gpu_id  0 
	ndims   3 
	dims    540 540 540 
	inembed 540 540 271 
	istride 1 
	idist   79023600 
	onembed 540 540 542 
	ostride 1 
	odist   158047200 
	batch   1 
	type    C2R 
	wkspc   manual 
	Python traceback:

========= sending heartbeat at 2024-07-02 16:54:51.410624
========= sending heartbeat at 2024-07-02 16:55:01.427252
/home/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 1684099 Killed                  python -c "import cryosparc_compute.run as run; run.run()" "$@"
slurmstepd: error: Detected 1 oom_kill event in StepId=2246009.0. Some of the step tasks have been OOM Killed.
srun: error: tempest-gpu009: task 0: Out Of Memory

It looks like the Out of Memory error is being thrown once the refinement starts working on unbinned data. If I bump the requested memory up to 360 Gb, the job runs no problem.

An update on my previous post:
We have found that particles with a box size of 540 fail with any memory allocation below 96 GB in both NU Refinement and Local Refinement.
However, particles with a box size of 440 or 336 run fine with a memory allocation of 36 GB.