No heartbeat in 4.2.1

Hi,

I’ve seen others have had this problem before- in my case I think it is caused when the network filesystem has a “hiccup”. This is unfortunate as the jobs themselves frequently do not die, but keep running and updating the DB. In many cases it will eventually switch from failed to completed- but the danger is the scheduler loses touch with the job and can’t kill it and may schedule another job on the same GPU.

Is the heartbeat system using a temporary file in the job directory to communicate between the worker and the scheduler?

Looking for a clean way to mark the job as still running.

Thanks,
-Craig

just notes from investigating… it seems the DB ‘heartbeat_at’ field keeps the job looking alive in the UI, but not sure then why the filesystem is causing the heartbeat update to fail.

db.jobs.update_one({'project_uid': 'P192', 'uid': 'J357'}, {'$set': {'status': 'running', 'failed_at': None, 'heartbeat_at': datetime.datetime.utcnow()}})

I don’t think the process to update the heartbeat is run from the worker, but from the scheduler on the master server?

I found out how the scheduler checks for jobs with old hearbeats. I gave my errant jobs an extra day to wrap up:

from datetime import datetime, timedelta
db.jobs.update_one({'project_uid': 'P192', 'uid': 'J357'}, {'$set': {'status': 'running', 'failed_at': None, 'heartbeat_at': datetime.utcnow()+timedelta(days=1)}})

@yoshiokc Adding an increased heartbeat timeout interval may also mitigate the issue. You could include
export CRYOSPARC_HEARTBEAT_SECONDS=600
inside
/path/to/cryosparc_master/config.sh.
A CryoSPARC restart would be required for the customized setting, whose default is 60, to become effective.

Thanks @wtempel,

I tried that already… oddly enough, whatever network or filesystem glitch is triggering the heartbeat timeout seems to happen regardless of wether connectivity is lost for a 5 seconds or minutes. That’s why I was curious how the heartbeat from workers to the scheduler was implemented (ie. is it doing something like opening a file and checking for updates? in which case a stale file descriptor might never recover even if network and FS do).

I am passing along my colleague’s description of an alternative failure mode that is independent from CRYOSPARC_HEARTBEAT_SECONDS:

  1. The worker attempts to send a heartbeat to the master
  2. Due to a network error, the heartbeat request fails
  3. The worker waits 10 seconds before attempting another heartbeat
  4. After 3 consecutive heartbeat failures. The worker terminates itself
  5. After not receiving any heartbeat, command marks the worker as Failed

Thanks @wtempel , is it possible the worker isn’t terminating in all circumstances/jobs? definitely seen a few jobs still running even after the server logs a ‘no hearbeat detected for last ## secs’.

It is possible that a worker process continues running after

but perhaps under a different, CRYOSPARC_HEARTBEAT_SECONDS-dependent, failure mode? For a future CryoSPARC release, we plan to change how worker processes are handled following a CRYOSPARC_HEARTBEAT_SECONDS-dependent job failure.

cool, thanks for checking on this!