Job is unresponsive - no heartbeat received in 30 seconds

Hi,

We have quite many jobs that dies with that “Job is unresponsive - no heartbeat received in 30 seconds.” error message. Sometimes, we just re-submit them and it’s enough to make them work.

Do you know how to debug this kind of errors?

Our system is:

  • Cuda: cuda91
  • OS: RHEL 7.4
  • Cluster: slum
  • CryoSparc 2.4.5

Thanks,
Best,
Nicolas