Processes stalling when submitting multiple jobs on cluster

orangeboomerang · July 8, 2019, 5:45pm

hi,

we have cryosparc installed on our cluster with 5x GPU nodes, each with 4 cards. Currently out IT people have it set up so we can submit cryosparc jobs and they get delegated to those cards in a non-random way (it seems to always start on the same GPU box until that is full and then moves to the second GPU box). In this way it seems that optimal use of resources is achieved since multiple cryosparc jobs can be shared within the same node, rather than 1 job per node (occupying all nodes).

Anyway, the problem is that when I try to run a few jobs on the same node (particularly big jobs with lots of particles) then they eventually all kinda fizzle out and stop running. There is no error, it just kinda gets stuck on “computing FSCs” for example. It can stay stuck for days. My solution has been to simply run 1 job at a time, or instead to submit the same job 5x until it starts using a second GPU node, then cancel the first 4 submissions. It’s kinda silly. Not really sure if this is a cryosparc issue or an issue for our IT guys. Even when I turn off “cache on SSD” it still happens.

Anyway, keep up the good work, thanks in advance.

stephan · July 30, 2019, 5:51pm

Hey @orangeboomerang,

Does this problem still apply? Its possible that the job has outputted more information to it’s log, which you can read by running:
cryosparcm joblog <project_uid> <job_uid>
You might be able to monitor it using your cluster’s process monitoring tools as well.
The specific stage of the refinement job you’re referring to is very CPU-intensive. Is it possible that somehow the cluster gets bogged down? Maybe monitoring at this stage is worth it.

orangeboomerang · November 26, 2019, 5:13pm

I checked the job.log in the output folder and it is ~6000 lines saying “sending heartbeat”. Then once every several thousand lines something like “0.143 at 49.206 radwn. 0.5 at 36.500 radwn. Took 2726.847s. ---- Computing FSC with mask 6.25 to 23.00”

This problem definitely persists, I suspect it may be worse when certain users try to submit jobs using more resources than everyone else (e.g. putting 8 GPUs to parrallelize when that is probably impossible on our setup).

I’ve gotten into the habbit of having to resubmit jobs time and again until hopefully once it will work because nobody else has submitted to that node.