We have a cluster of 4 nodes and submit jobs using slurm. Occasionally, two jobs that require >50% of scratch space get sent to the same node, and the second job has to wait for hours until the scratch is available. My two workarounds are to keep resubmitting the waiting job until it gets sent to a different node, or to set up additional lanes corresponding to each node “for emergency use only”, but there must be a better way. Is it possible to make the -nodelist option available to be sent to slurm in these cases, or is there some other solution?
In the above example, the var extra_param, added to the end of any arbitrary line in the header, can be defined as “-w node[2-4]” or “-x node1” at time of submission to either include or exclude certain nodes, respectively.
It can also be set to a string of slurm flags, e.g. “-w node2 --mem=60G --constraint=intel”.