Endless "Cache waiting" errors when multiple Non uniform jobs are launched at once

open

#1

Hi,

Often when I launch multiple non uniform refinement jobs (from the same dataset) at once (this may happen with other job types but that is where I have noticed it), all the jobs get stuck with the “cache waiting for requested files to become unlocked” error attached.

This error does not appear if jobs are launched sequentially, but once the jobs are stuck in this state, even killing all except one of the jobs does not result in the remaining job proceeding to completion. There is an easy workaround - just wait 30s between submitting each job - but this still seems like a bug of some kind (still present in v2.11).

Cheers
Oli


#2

Hi @olibclarke,
We haven’t been able to reproduce this ourselves - what’s the setup? jobs being launched on the same standalone worker node? Is the master running on the same node?


#3

Hi Ali,

This happens frequently for both my two systems (standalone GPU workstations), and the standalone GPU workstation of an adjacent lab. In all cases the master and worker are on the same node, yes. Happy to provide any info you need to debug, just let me know what would be useful.

Cheers
Oli


#4

Another possibly related bug - if I start a local refinement job with multiple particle inputs, I get this same endless cache waiting error, even if it is the only job submitted: