Our cluster includes nodes that each have four 80 GB A100 GPUs, as well as older nodes that each have four 32 GB V100 GPUs. We are considering splitting up some of the 80 GB A100s using MIG (Multi-Instance GPU). However, looking at all the jobs run on these nodes this year, about 95% of the jobs used more than 40 GB of GPU memory. (This includes CryoSPARC, RELION, AlphaFold, and some other jobs).
This is surprising, because when we only had 32 GB V100 nodes, no jobs were seen to run out of GPU memory.
Is CryoSPARC able to detect how much GPU memory is available, and configure each job accordingly? For example, will it process fewer images at once if only 40 GB is available, and more images at once if 80 GB is available? That would explain why things worked on the V100s.
Thanks,
Matthew