I was hoping to get some help estimating CPU memory requirements for homogeneous refinement jobs, as I have a colleague who keeps running into OOM errors on our cluster.
The cluster is equipped with A40s (~40 GB VRAM) and each A40 has 128 GB CPU memory available to it. We have set the RAM variable so that we are requesting the full 120 GB, and the submission script and job logs show that the memory is being properly requested.
The virus has an unbinned box size of 1260, and they have binned the dataset to a box size of 630. They have 300,000 particles, which have successfully refined into a high-resolution structure. However, they are clearly nyquist limited, and we are trying to unbin.
Every unbinned job fails, with the cgroup killing the process for running out of memory. I don’t think it’s GPU memory, because I don’t believe those trigger the cgroup OOM killer, which we are obviously triggering. The strange thing, though, is that while the 300,000, 630 box size dataset works, even dropping down to just 30,000 particles at 1260 fails every single time.
If I do some back of the napkin math here:
300,000 particles * 630 pixels X * 630 pixels Y * 8 bits/pixel = ~120 GB.
30,000 particles * 1260 pixels X * 1260 pixels Y * 8 bits/pixel = ~50 GB.
Now, I realize it probably also needs to hold the maps in memory (Another ~2-3 GB unless those are 16 bit?) and I know that local CTF refinement, and global CTF refinement also use additional memory. But I can’t seem to make any sense of it, and was wondering if the CryoSPARC team had any rule of thumb relating to box sizes and CPU memory, and if there isn’t something weird going on here.