A user of our cluster is trying to carry out NU refinement and experiencing job failure that appears to be due to memory issues.
We are looking at a reasonably big particle where a box size of 864 allows us to see important additional features. We can run NU refinement successfully at a box size of 810, but any larger seems to run into job failure.
I am hoping to get some advice on the likely cause of job failure and if there is any way of predicting how much memory (either system RAM or VRAM) is likely to be needed for NU refinement.
Our cluster nodes have 250 Gb system RAM and 48Gb VRAM (A6000) available, and my reading of this forum has led me to think that they should be able to handle a larger box size than 810px. We are not using CTF refinement during NU refinement jobs and are running in low-memory mode since we have experienced memory related job failures.
Are there any characteristics of samples or systems that can lead to higher than usual memory requirements?
@EdLowe Please can you post the exact error message at the job’s failure?
Also, can one assume that a the failed job had exclusive and complete use of the node’s system and GPU RAM. Some cluster workload managers limit jobs to a subset of the node’s resources. Such constraints and potential node sharing would not be a problem per se, but relevant to the examination of the failure’s cause.
Thanks very much for your reply. The job should have full and exclusive access to the node’s RAM in this case as far as I know. The job is being allocated 240 Gb system RAM by slurm and under these conditions the node will not accept any other jobs. I am not aware of the job’s access to VRAM being constrained in any way - if slurm or cryoSPARC configuration are doing so I am not aware of it.
The slurm.err file contains this line only:
slurmstepd-linux1006: error: Detected 1 oom-kill event(s) in step 7500202.batch cgroup. Some of your processes may have been killed by the
cgroup out-of-memory handler.
The largest amount of RAM I can see reported as used in the event log is 194.26 Gb. Large, but quite a long way short of the system RAM available.
These are the lines in job .log following the last succesful cycle of refinement (with many hearbeat lines removed)
:1: RuntimeWarning: invalid value encountered in true_divide
:1: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: NumPy 1.20.0 Release Notes — NumPy v2.5.dev0 Manual
:1: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: NumPy 1.20.0 Release Notes — NumPy v2.5.dev0 Manual
/mnt/beegfs/software/structural_biology/release/cryosparc/seiradake/cryosparc/cryosparc_worker/cryosparc_compute/sigproc.py:660: FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1.
x = n.linalg.lstsq(w.reshape((-1,1))A, wb)[0]
/mnt/beegfs/software/structural_biology/release/cryosparc/seiradake/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:571: RuntimeWarning: divide by zero encountered in log
logabs = n.log(n.abs(fM))
/mnt/beegfs/software/structural_biology/release/cryosparc/seiradake/cryosparc/cryosparc_worker/cryosparc_compute/plotutil.py:44: RuntimeWarning: invalid value encountered in sqrt
cradwn = n.sqrt(cradwn)
:1: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
:1: RuntimeWarning: invalid value encountered in true_divide
Quick comment; some distros have the OOM daemon set extremely aggressively (i.e. it’ll kill the process long before you consume all system RAM). Ubuntu had problems with this initially in beta testing of 24.04 if I remember correctly. Server distros tend to prioritise not having the system hard lock (which can happen on OOM)
Also some memory spikes are faster than user-accessible polling reads by default (or CryoSPARC reports). If a 1050 pixel box can consume 440GB+ in NU refine, it wouldn’t surprise me if 850+ is maxing out a system with 250GB. A quick test (not recommended for long-term deployment) would be create an 250GB swapfile on an NVMe SSD for that node, see if it breaks 250GB. Then you know you need an upgrade.
Thanks, this is helpful. In this specific case we were not resolution limited and were able to get around the problem by Fourier cropping to 810 px, but it is likely that either larger systems or similar systems at higher resolution are in our future, so we will need to look at expanding the RAM on at least a few nodes.