Memory overload causing heartbeat error

Hi,

I’m running a homogenous refinement using cryoSPARC and I am encountering issues with memory overload. However, I’m wondering if something is not configured correctly. Unfortunately, I am not the admin on this cluster. Below is the last segment of the event log if that helps. I’ve given this process 96 GB of ram a single GPU and 16 CPU cores (though I am not sure the details of the processor units at this time).

:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
========= sending heartbeat at 2025-01-24 14:18:27.295828
:1: RuntimeWarning: invalid value encountered in true_divide
========= sending heartbeat at 2025-01-24 14:18:37.340825
========= sending heartbeat at 2025-01-24 14:18:47.357681
========= sending heartbeat at 2025-01-24 14:18:57.375894
/blue/rmckenna/apps/cryosparc/cryosparc_worker/cryosparc_compute/sigproc.py:656: FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1.
x = n.linalg.lstsq(w.reshape((-1,1))A, wb)[0]
========= sending heartbeat at 2025-01-24 14:19:07.393951
========= sending heartbeat at 2025-01-24 14:19:17.405793
/blue/rmckenna/apps/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:571: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
========= sending heartbeat at 2025-01-24 14:19:27.420333
/blue/rmckenna/apps/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:44: RuntimeWarning: invalid value encountered in sqrt
cradwn = n.sqrt(cradwn)
========= sending heartbeat at 2025-01-24 14:19:37.435207
========= sending heartbeat at 2025-01-24 14:19:47.452905
========= sending heartbeat at 2025-01-24 14:19:57.468302
========= sending heartbeat at 2025-01-24 14:20:07.487353
========= sending heartbeat at 2025-01-24 14:20:17.504962
========= sending heartbeat at 2025-01-24 14:20:27.645527
========= sending heartbeat at 2025-01-24 14:20:37.663808
========= sending heartbeat at 2025-01-24 14:20:47.677319
/blue/rmckenna/apps/cryosparc/cryosparc_worker/bin/cryosparcw: line 153: 404114 Killed python -c “import cryosparc_compute.run as run; run.run()” “$@”
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=56603382.batch. Some of your processes may have been killed by the cgroup out-of-memory

Just going to leave this here in case anyone encounters this as it was quite simple, but annoying to find out. Double check with your administrator that the command line you’re using to adjust resource allocation is properly formatted.

Thanks for the update. So with the help of your sys admin, you were able to resolve this issue?