No heartbeat error in v4.4

wtempel · February 29, 2024, 9:15pm

@luisshulk Are you and the cluster admin still observing that nodes are drained due to
Reason=Kill task failed? There are several discussions of this topic.

The motivation behind this seemingly annoying node drain is explained here.

There are also debugging and resolution suggestions.

This observation makes me wonder if adding the scancel --full option to your cluster target configuration might help.

"qdel_cmd_tpl": "scancel {{ cluster_job_id }}"

to

"qdel_cmd_tpl": "scancel -f {{ cluster_job_id }}"

I have not tested this.