I am able to thread jobs across multiple GPUs, but we just can’t launch more than four jobs at a time. I don’t see anything in the installation guide or here about this, and am wondering if I somehow incorrectly configured the installation.
A quick follow up - I am noticing that some of the jobs are throwing the following errors:
Job is unresponsive - no heartbeat received in 30 seconds.
The job then shows as “Failed”, and removes itself from the queue.
However, for some reason the job is still running in the background after the heartbeat error was thrown, and runs to completion (while taking up resources in the background - not showing in queue). I believe this is causing the resource allocation problem.
Hi @bowman, thanks for reporting this. We have found a bug that would cause num_cpus*2 redundant threads to be created for every cryoSPARC worker process, we will fix that. The heartbeat error is probably a separate problem. Can you see where during a processing run (and in which job types) the heartbeat error occurs?
We isolated it to a cache issue - when Cryosparc reads files the kernel was caching them into memory (which is not needed). We clear unused cache constantly and it has stopped the heartbeat errors.