I’m trying to troubleshoot a Non-uniform Refinement job that cryosparc submits to our SLURM cluster. On the surface, it looks like an insufficient memory request error. I’ve copied the user’s job and have been trying to walk up increasing memory requests while recording actual memory use with ‘top -bp’ at 1 second intervals. I see the job fail consistently at around 80gb resident memory use and just under 1200gb of virtual memory. I’ve looked at Reference-based motion correction out of memory on cluster - #5 by CleoShen . Our GPU nodes with a100 GPUs have 2 TB memory. However, even increasing the job memory request to 1TB doesn’t help and results in the same error with the same ~80gb of resident and ~1200gb virtual memory use at the time of the error. As far as I know our SLURM configuration doesn’t limit the virtual memory size as far as I know, only the resident memory. The resident memory numbers (80gb used vs 1TB requested) aren’t nearly close to be the reason for the issue. I think the limit CryoSPARC checks for while trying to allocate memory for the next stage must be something else.
Resources allocated:
…
CPU : [0, 1, 2, 3]
GPU : [0]
RAM : [0, 1, 2]
SSD : False
Input particles have box size 800
Input particles have pixel size 0.7753
Error:
[CPU: 15.06 GB]
====== Starting Refinement Iterations ======
[CPU: 15.06 GB]
----------------------------- Start Iteration 0
[CPU: 15.06 GB]
Using Max Alignment Radius 20.675 (30.000A)
[CPU: 15.06 GB]
Auto batchsize: 14109 in each split
[CPU: 23.12 GB]
– THR 1 BATCH 500 NUM 3500 TOTAL 1618.0549 ELAPSED 4857.4315 –
[CPU: 45.65 GB]
Processed 28218.000 images in 4867.541s.
[CPU: 49.80 GB]
Computing FSCs…
[CPU: 49.80 GB]
Using full box size 800, downsampled box size 400, with low memory mode disabled.
[CPU: 49.80 GB]
Computing FFTs on GPU.
[CPU: 48.83 GB]
Done in 99.086s
[CPU: 48.83 GB]
Using Filter Radius 116.812 (5.310A) | Previous: 20.675 (30.000A)
[CPU: 64.47 GB]
Non-uniform regularization with compute option: GPU
[CPU: 64.47 GB]
Running local cross validation for A …
[CPU: 134.4 MB]
DIE: allocate: out of memory (reservation insufficient)
[CPU: 140.8 MB]
====== Job process terminated abnormally.
Thanks,
Alex