DIE: allocate: out of memory (reservation insufficient)

I’m trying to troubleshoot a Non-uniform Refinement job that cryosparc submits to our SLURM cluster. On the surface, it looks like an insufficient memory request error. I’ve copied the user’s job and have been trying to walk up increasing memory requests while recording actual memory use with ‘top -bp’ at 1 second intervals. I see the job fail consistently at around 80gb resident memory use and just under 1200gb of virtual memory. I’ve looked at Reference-based motion correction out of memory on cluster - #5 by CleoShen . Our GPU nodes with a100 GPUs have 2 TB memory. However, even increasing the job memory request to 1TB doesn’t help and results in the same error with the same ~80gb of resident and ~1200gb virtual memory use at the time of the error. As far as I know our SLURM configuration doesn’t limit the virtual memory size as far as I know, only the resident memory. The resident memory numbers (80gb used vs 1TB requested) aren’t nearly close to be the reason for the issue. I think the limit CryoSPARC checks for while trying to allocate memory for the next stage must be something else.

Resources allocated:

CPU : [0, 1, 2, 3]
GPU : [0]
RAM : [0, 1, 2]
SSD : False

Input particles have box size 800
Input particles have pixel size 0.7753

Error:

[CPU: 15.06 GB]

====== Starting Refinement Iterations ======
[CPU: 15.06 GB]

----------------------------- Start Iteration 0
[CPU: 15.06 GB]

Using Max Alignment Radius 20.675 (30.000A)
[CPU: 15.06 GB]

Auto batchsize: 14109 in each split
[CPU: 23.12 GB]

– THR 1 BATCH 500 NUM 3500 TOTAL 1618.0549 ELAPSED 4857.4315 –
[CPU: 45.65 GB]

Processed 28218.000 images in 4867.541s.
[CPU: 49.80 GB]

Computing FSCs…
[CPU: 49.80 GB]

Using full box size 800, downsampled box size 400, with low memory mode disabled.
[CPU: 49.80 GB]

Computing FFTs on GPU.
[CPU: 48.83 GB]

Done in 99.086s

[CPU: 48.83 GB]

Using Filter Radius 116.812 (5.310A) | Previous: 20.675 (30.000A)
[CPU: 64.47 GB]

Non-uniform regularization with compute option: GPU
[CPU: 64.47 GB]

Running local cross validation for A …
[CPU: 134.4 MB]

DIE: allocate: out of memory (reservation insufficient)
[CPU: 140.8 MB]

====== Job process terminated abnormally.

Thanks,

Alex

Turn on Low Memory Mode and see whether it still crashes.

It will run a lot slower, but for some reason (Python, or more specifically PyFFTW, I think) CryoSPARC doesn’t scale linearly with memory like RELION does - there are some box sizes that, no matter what GPU they are run on or system RAM present, will crash.

1 Like

Hello Cryospark Developers,

Could I please get your input? I tried the above, plus these parameters on NU Refine job :
7 Custom Parameters:

  1. Cache particle images on SSD: false
  2. Disable auto batchsize: true
  3. Show plots from intermediate steps: false
  4. GPU batch size of images: 1
  5. GPU batch size of images: 1
  6. GPU batch size of images: 1
  7. Low-Memory Mode: true
    The inputs for this refine volume are an imported volume, and particles from Extract Mics (G) job from which I have tried asking for 800-600 px boxes. Nothing has worked. I did notice that implementing the 1-7 parameters above ran the job to ~5hrs. before it crashed. Not doing so (applying Low memory only) the job would crash ~1 hr. The error message has been the same: "Job process terminated abnormally, no outputs.

Time isn’t really important when changing settings like batch size, as decreasing the batch size so much will slow processing down… sometimes dramatically depending on the system. What iteration does it crash on? Is it crashing at a particular resolution (possibly memory capacity issue), or is it generating NaN results (check for corrupt particles if so) or blank volumes (check filter resolution settings)?

System specifications would be helpful; there is a big difference between running on an 8GB GPU (no longer supported, but still works if careful and box sizes not too large), 11GB, and 24GB, for example. Same with system memory; 64GB is not really enough and even 128GB can choke in many situations.

If CryoSPARC itself isn’t reporting any errors, you’ll need to check dmesg and other system logs to see if they contain more information.