we are running CryopSparc (including Topaz) on a compute cluster which utilizes cgroups to make sure that jobs don’t overstep their resource allocation (CPU cores, GPUs, memory).
It seems that Topaz’s “Number of parallel Threads” parameter in the CryoSparc job builder is not communicated to the cluster job template, leading to oversubscription of CPU cores (i.e. the job template only reserves “Number of CPUs” while Topaz starts “Number of CPUs” x “Number of parallel Threads” during the preprocessing step). Is that correct or am I missing something obvious here?
For the time being, I advised our users to set the number of parallel threads to 1.
To make the Parallel Threads option usable for clusters (presuming that my above assumption is indeed correct), I’m currently seeing several possible solutions:
- Forward the the CPU reservation as x instead of only ; this will probably lead to suboptimal utilization of resources as it seems that multithreading is only used in the preprocessing step.
- If there is no way to forward the additional amount of threads, a special Topaz cluster lane that reserves a fixed amount of threads per CPU (and communication this to the users) could be an option, although that is hardly a step forward from the current situation and has the same problems as 1) with thread counts > 1.
- Hard-set --num-workers=1 for preprocessing, i.e. only use the multithreading for this step, and then use --num-workers= for the training step, where (if I read the run_topaz.py code correctly) multiprocessing is not used.
What are your opinions and observations on this? Is there a simple solution that I might have missed?