One of our users has maxed out our scratch and swap space while running a non-uniform refinement job. However, there has been no updates in the cryosparc web ui for the past day and a half.
When you try and view the joblog, all it says is “sending heartbeat.” Nvidia-smi still says cryosparc is running a python job, but the amount of VRAM usage seems low. How can we definitively verify that the job is still running?
Unfortunately, I cannot answer this question. I would nevertheless like to share an opinion, which I share, from one of our team members:
Pretty much any time a job gets deep into swap, I would personally recommend killing it. Even if it does eventually complete, it could cause the machine to become unresponsive, and could take orders of magnitude longer than it should.