Dear community,
This morning two of my homogeneous refinement (new) were killed by the same error message “job is unresponsive, no heartbreak received in 30 seconds”. Can anyone please let me know how to avoid this in the future? It’s pretty frustrating especially when my job was killed at probably the end of the last iteration after 2 days.
Thank you so much for your help.
The new refinement job does currently require more CPU RAM than the (legacy) refinement. So could it be that your two jobs running at the same time caused the machine to start swapping? This is usually what causes the system to stall long enough to cause the heartbeat error (which indicates that the jobs are non-responsive).
Also, how large is your dataset? Generally refinements should not be taking 2 days!
Hello,
You are right. I was running 4 jobs on 4 GPUs. I’m currently re-running the job; hopefully it will go smoothly.
I have ~160k particles on that dataset. Hopefully this time it won’t take 2 days:)
Thank you so much for your help.
—Da