Cache requirement for SLURM

DanielAsarnow · November 1, 2022, 12:10am

Is it possible to add a variable for estimated cache space (say, input particle space + 10%) required that can be passed to cluster submission scripts?

wtempel · November 3, 2022, 6:11pm

Thank you @DanielAsarnow for this suggestion. We are considering its future implementation.

DanielAsarnow · November 3, 2022, 6:24pm

Thanks! It could be pretty simple, just the particle stacks called by a job + 10% for instance would work well.

We have cluster nodes with large caches and several GPUs each, but it’s not too hard to fill the caches when a few jobs are running concurrently. Previously they were limiting the scheduler to just one job per node for this reason, which was a huge waste of GPUs. Now we’ve opened it up but there are some cache space collisions happening relatively frequently. We can add a generous cache requirement as a workaround, but it’s inefficient (in GPU utilization). So a simple, moderate over-estimate of the cache requirement would really help.

It would also be great if particles were re-stacked on the cache (which is what Relion does) - then only the used particles are copied and the cache requirements are less.