Failed job still running?

lizellelubbe · July 2, 2021, 6:23pm

Hi,
I just got an error: “job is unresponsive - no heartbeat received in 30 seconds” but the job seems to still be producing output.

It still shows up in nvidia-smi as a python job but disappeared from the cryosparc queue and is marked as failed. Has anyone seen this before?
I don’t know if it will keep running or produce results but also don’t know how to stop it to resubmit another job.
I am using cryosparc v3.2.0+210511

ClaudiaKielkopf · July 5, 2021, 11:03am

Hej there,

yes, I’ve also seen this before on the HPC cluster we use, no idea what’s causing this… My jobs usually finish despite this weird behaviour. Did yours too in the end? It made be a bit more cautious about instantly cloning or clearing failed jobs.

Cheers,
Claudia

apunjani · July 5, 2021, 3:15pm

Hi @ClaudiaKielkopf, @lizellelubbe,

The heartbeat system in cryoSPARC is what we use to monitor running jobs - the jobs send a hearbeat every 30 seconds by default and if the master process doesn’t get the heartbeat, it marks the job as failed (typically this happens because a compute node goes down or the process fails for an unknown reason). Sometimes however if the job is still running but fails to send its heartbeat (due to network error, or a stall somewhere in the cluster system etc) and more than 30 seconds elapse, the master will think the job is failed but the job will keep running. This is not a problem, i.e. the job will complete correctly and output results and you can use these for further processing. But it does confuse the master application and the user!
The way to work around this is to increase the heartbeat time. You can do this with the instructions posted here:

lizellelubbe · July 5, 2021, 4:40pm

Hi @ClaudiaKielkopf and @apunjani,
My job finished in the end but I am glad to now understand what happened and how to increase the heartbeat time. Thanks for the replies!