Topaz Threads not appearing in cluster integration

sittr · September 20, 2022, 9:30am

Hello everyone,

we are running CryopSparc (including Topaz) on a compute cluster which utilizes cgroups to make sure that jobs don’t overstep their resource allocation (CPU cores, GPUs, memory).
It seems that Topaz’s “Number of parallel Threads” parameter in the CryoSparc job builder is not communicated to the cluster job template, leading to oversubscription of CPU cores (i.e. the job template only reserves “Number of CPUs” while Topaz starts “Number of CPUs” x “Number of parallel Threads” during the preprocessing step). Is that correct or am I missing something obvious here?
For the time being, I advised our users to set the number of parallel threads to 1.

To make the Parallel Threads option usable for clusters (presuming that my above assumption is indeed correct), I’m currently seeing several possible solutions:

Forward the the CPU reservation as x instead of only ; this will probably lead to suboptimal utilization of resources as it seems that multithreading is only used in the preprocessing step.
If there is no way to forward the additional amount of threads, a special Topaz cluster lane that reserves a fixed amount of threads per CPU (and communication this to the users) could be an option, although that is hardly a step forward from the current situation and has the same problems as 1) with thread counts > 1.
Hard-set --num-workers=1 for preprocessing, i.e. only use the multithreading for this step, and then use --num-workers= for the training step, where (if I read the run_topaz.py code correctly) multiprocessing is not used.

What are your opinions and observations on this? Is there a simple solution that I might have missed?

Kind regards,
R. Sitt

kpahil · September 21, 2022, 1:25am

For what it’s worth I’ve noticed the same problem, which has sometimes caused topaz jobs to hang because of oversubscription to cpu cores (though my solution was just to do preprocessing outside of the cryosparc gui)

sittr · September 21, 2022, 12:08pm

Addendum: Apparently the formatter/markdown swallows text in angular brackets; The first part of ‘1.’ should read:

Forward the CPU reservation as “Amount of CPUs” x “Amount of Parallel Threads” instead of only “Amount of CPUs”; […]

CleoShen · September 21, 2022, 5:44pm

The same. One more thing I found is the Topaz job would be failed if I requested two more GPUs, only working on one GPU.

wtempel · September 22, 2022, 6:44pm

@sittr Welcome to the forum. Thank you for reporting the issue and sharing your suggestions. We are looking into it.