My CS jobs do not read "export CRYOSPARC_HEARTBEAT_SECONDS"?

N.T · October 8, 2021, 7:03pm

Hi all,

My jobs on an HPC cluster often fail with the error message:
“Job is unresponsive - no heartbeat received in 30 seconds.”
I increased “CRYOSPARC_HEARTBEAT_SECONDS” to 180 then 3600, but I still get the same error.

cryosparcm status

gives “export CRYOSPARC_HEARTBEAT_SECONDS=3600” properly.

Did CS job ignore the info, or see no heartbeat for 3600 sec and export the “30 sec” error?

Let me know anything wrong with the line or if any log would be useful for diagnosis.
Thanks for your help!

N.T · October 11, 2021, 5:29pm

Not sure if these are relevant, but the system configuration is:
CentOS Linux 7, Linux 3.10.0-1160.42.2.el7.x86_64
CUDA 9.2.88
Master and worker are installed in the same node.

This has been a long-standing issue since v2.9, but the current version I use is v3.2.0+210817.
Also, when I change the port number in config.sh, CS freezes during the software launching (so I use the original).
The no heartbeat error hampers the data processing scheme as it never completes “heavy” jobs.
Any help is appreciated. Thanks,

stephan · October 13, 2021, 3:05pm

Hi @N.T,

Usually, when users report that the environment variables they set aren’t working, it’s because they haven’t done a full restart of cryosparc: cryosparcm stop then cryosparcm start.
In your case, the error message you’re seeing is actually hard coded (this is going to be fixed in the next patch), so

is most likely what happened.

This heartbeat error occurs when the process running the cryoSPARC job doesn’t report back to the master instance- this could be if the process was killed abnormally, or if there are network issues hampering the API call back to the master instance. I’d suggest looking into whether your cluster resource manager is killing off cryoSPARC jobs- is there a time limit on jobs? Are there network errors? You can use the cluster resource manager’s API to query the job’s history, and check the jobs own log as well: cryosparcm joblog <PUID> <JUID>

N.T · October 13, 2021, 6:04pm

Thanks for your reply and help, @stephan.

I carefully checked cryosparcm stop/start to attempt the changes several times, so that part should be fine.
It’s good to know that it exports “30 sec” anyway, and it sounds like the problem is the job killed by the resource manager.
Jobs should run for ~7 days on that node, but probably I should manually add that 7-day limit just in case.

I always clear failed jobs to save some storage space, but when I see the same error next time, I will check cryosparcm joblog.

Best, -NT