Processes stalling when submitting multiple jobs on cluster

open

#1

hi,

we have cryosparc installed on our cluster with 5x GPU nodes, each with 4 cards. Currently out IT people have it set up so we can submit cryosparc jobs and they get delegated to those cards in a non-random way (it seems to always start on the same GPU box until that is full and then moves to the second GPU box). In this way it seems that optimal use of resources is achieved since multiple cryosparc jobs can be shared within the same node, rather than 1 job per node (occupying all nodes).

Anyway, the problem is that when I try to run a few jobs on the same node (particularly big jobs with lots of particles) then they eventually all kinda fizzle out and stop running. There is no error, it just kinda gets stuck on “computing FSCs” for example. It can stay stuck for days. My solution has been to simply run 1 job at a time, or instead to submit the same job 5x until it starts using a second GPU node, then cancel the first 4 submissions. It’s kinda silly. Not really sure if this is a cryosparc issue or an issue for our IT guys. Even when I turn off “cache on SSD” it still happens.

Anyway, keep up the good work, thanks in advance.


#2

Hey @orangeboomerang,

Does this problem still apply? Its possible that the job has outputted more information to it’s log, which you can read by running:
cryosparcm joblog <project_uid> <job_uid>
You might be able to monitor it using your cluster’s process monitoring tools as well.
The specific stage of the refinement job you’re referring to is very CPU-intensive. Is it possible that somehow the cluster gets bogged down? Maybe monitoring at this stage is worth it.