Hi,
We have quite many jobs that dies with that “Job is unresponsive - no heartbeat received in 30 seconds.” error message. Sometimes, we just re-submit them and it’s enough to make them work.
Do you know how to debug this kind of errors?
Our system is:
- Cuda: cuda91
- OS: RHEL 7.4
- Cluster: slum
- CryoSparc 2.4.5
Thanks,
Best,
Nicolas