Removing Zombie Jobs

gebauer · November 14, 2022, 11:01am

Hi all,

during our installation of a new node we did a couple of test runs and finally changed the registration from “cluster” to “internal managed node”. Now, we have a couple of zombie jobs, which cannot be killed, as they assume a command for killing (scancel) which is not available on the node anymore (no slurm anymore). So the “kill job” button does not work and we can also not mark the jobs as complete (button is greyes out).
Can we somehow get rid of these zombie jobs?

I think this is mostly a problem of our bad installation job

Best
Jan

wtempel · November 14, 2022, 9:27pm

You may try this for a hypothetical “zombie” job J123 in project P999 under the Linux account that runs the CryoSPARC instance:

ensure the corresponding process on the compute node has been terminated
cryosparcm cli "set_job_status('P999', 'J123', 'killed')"
cryosparcm cli "set_job_status('P999', 'J123', 'completed')"

gebauer · November 15, 2022, 4:31pm

Dear wtempel,

this is exactly what I needed.
Do I really need to kill AND complete the jobs.
After killing them they are gone from the statistics…

Best and thanks
Jan

wtempel · November 15, 2022, 7:14pm

You don’t have to, unless you are interested in the given jobs preliminary outputs.