Failed job still running?

apunjani · July 5, 2021, 3:15pm

The heartbeat system in cryoSPARC is what we use to monitor running jobs - the jobs send a hearbeat every 30 seconds by default and if the master process doesn’t get the heartbeat, it marks the job as failed (typically this happens because a compute node goes down or the process fails for an unknown reason). Sometimes however if the job is still running but fails to send its heartbeat (due to network error, or a stall somewhere in the cluster system etc) and more than 30 seconds elapse, the master will think the job is failed but the job will keep running. This is not a problem, i.e. the job will complete correctly and output results and you can use these for further processing. But it does confuse the master application and the user!
The way to work around this is to increase the heartbeat time. You can do this with the instructions posted here: