Job is unresponsive - no heartbeat received in 30 second

donaldb · February 16, 2019, 1:56pm

I often get these error messages when my jobs are running. The jobs still complete and are not dead. This causes problems with queueing, as the next job in the queue will launch. This can then crash both jobs as they are launched on the same GPU.

I think this is often caused by the slightly laggy networked file system I am storing my project directories in.

Would it be possible for you to extend the ‘Unresponsive time’ maybe to 120 secs to reduce the regularity of these warnings and make it a bit more amenable to slightly sub-optimal file storage setups?

Thanks a lot.

apunjani · February 22, 2019, 6:37pm

Hi @donaldb,

This is a good idea - we can make it a configurable number in the next version!

Ali

marino-j · March 10, 2020, 11:50am

Hi - I get often this error too, is there a way to change that number ? Do I remember correctly that in cryosparc v1 one could change it ?
Many thanks !

stephan · March 10, 2020, 2:43pm

Hi @marino-j,

Yes, there is a way to change this number. You have to set the environment variable CRYOSPARC_HEARTBEAT_SECONDS in the cryosparc2_master/config.sh file.
For example, append:
export CRYOSPARC_HEARTBEAT_SECONDS=180 to cryosparc2_master/config.sh, and restart cryoSPARC: cryosparcm restart

marino-j · March 10, 2020, 4:03pm

done ! Thanks a lot for your help.

donghuachen · July 24, 2020, 5:35am

@apunjani, I am getting this error during Heterogeneous Refinement using v2.15.0 and JOB DETAILS already showed FAILED. However, my job somehow continued automatically after some time (not sure after how many seconds). Is it possible to default this parameter to 180 seconds (export CRYOSPARC_HEARTBEAT_SECONDS=180) in the next version? I don’t want to bother our cluster admin to make the change at this moment. Thanks so much!