Hi,
I’ve seen others have had this problem before- in my case I think it is caused when the network filesystem has a “hiccup”. This is unfortunate as the jobs themselves frequently do not die, but keep running and updating the DB. In many cases it will eventually switch from failed
to completed- but the danger is the scheduler loses touch with the job and can’t kill it and may schedule another job on the same GPU.
Is the heartbeat system using a temporary file in the job directory to communicate between the worker and the scheduler?
Looking for a clean way to mark the job as still running.
Thanks,
-Craig
just notes from investigating… it seems the DB ‘heartbeat_at’ field keeps the job looking alive in the UI, but not sure then why the filesystem is causing the heartbeat update to fail.
db.jobs.update_one({'project_uid': 'P192', 'uid': 'J357'}, {'$set': {'status': 'running', 'failed_at': None, 'heartbeat_at': datetime.datetime.utcnow()}})
I don’t think the process to update the heartbeat is run from the worker, but from the scheduler on the master server?
I found out how the scheduler checks for jobs with old hearbeats. I gave my errant jobs an extra day to wrap up:
from datetime import datetime, timedelta
db.jobs.update_one({'project_uid': 'P192', 'uid': 'J357'}, {'$set': {'status': 'running', 'failed_at': None, 'heartbeat_at': datetime.utcnow()+timedelta(days=1)}})
@yoshiokc Adding an increased heartbeat timeout interval may also mitigate the issue. You could include
export CRYOSPARC_HEARTBEAT_SECONDS=600
inside
/path/to/cryosparc_master/config.sh
.
A CryoSPARC restart would be required for the customized setting, whose default is 60
, to become effective.
Thanks @wtempel,
I tried that already… oddly enough, whatever network or filesystem glitch is triggering the heartbeat timeout seems to happen regardless of wether connectivity is lost for a 5 seconds or minutes. That’s why I was curious how the heartbeat from workers to the scheduler was implemented (ie. is it doing something like opening a file and checking for updates? in which case a stale file descriptor might never recover even if network and FS do).
I am passing along my colleague’s description of an alternative failure mode that is independent from CRYOSPARC_HEARTBEAT_SECONDS
:
- The worker attempts to send a heartbeat to the master
- Due to a network error, the heartbeat request fails
- The worker waits 10 seconds before attempting another heartbeat
- After 3 consecutive heartbeat failures. The worker terminates itself
- After not receiving any heartbeat, command marks the worker as Failed
Thanks @wtempel , is it possible the worker isn’t terminating in all circumstances/jobs? definitely seen a few jobs still running even after the server logs a ‘no hearbeat detected for last ## secs’.
It is possible that a worker process continues running after
but perhaps under a different, CRYOSPARC_HEARTBEAT_SECONDS
-dependent, failure mode? For a future CryoSPARC release, we plan to change how worker processes are handled following a CRYOSPARC_HEARTBEAT_SECONDS
-dependent job failure.
cool, thanks for checking on this!