Endless "Cache waiting" errors when multiple Non uniform jobs are launched at once

olibclarke · September 17, 2019, 5:47pm

Hi,

Often when I launch multiple non uniform refinement jobs (from the same dataset) at once (this may happen with other job types but that is where I have noticed it), all the jobs get stuck with the “cache waiting for requested files to become unlocked” error attached.

This error does not appear if jobs are launched sequentially, but once the jobs are stuck in this state, even killing all except one of the jobs does not result in the remaining job proceeding to completion. There is an easy workaround - just wait 30s between submitting each job - but this still seems like a bug of some kind (still present in v2.11).

Cheers
Oli

apunjani · September 23, 2019, 3:30pm

Hi @olibclarke,
We haven’t been able to reproduce this ourselves - what’s the setup? jobs being launched on the same standalone worker node? Is the master running on the same node?

olibclarke · September 23, 2019, 3:43pm

Hi Ali,

This happens frequently for both my two systems (standalone GPU workstations), and the standalone GPU workstation of an adjacent lab. In all cases the master and worker are on the same node, yes. Happy to provide any info you need to debug, just let me know what would be useful.

Cheers
Oli

olibclarke · September 25, 2019, 2:59pm

Another possibly related bug - if I start a local refinement job with multiple particle inputs, I get this same endless cache waiting error, even if it is the only job submitted:

olibclarke · March 9, 2020, 12:15am

Hi @apunjani,

Did you have any luck figuring this out? Is there any other info I can provide? It is still a frequent occurrence for myself and at least two other groups that I know of at Columbia. If I queue a series of NU or local refinement job, almost always at some point when two launch at the same time they will end up fighting over the cache, which is very frustrating, particularly if you launch them overnight and then check in the morning to find they have been sitting there not doing anything. Happy to provide any logs or files that would be useful to debug!

Cheers
Oli

nfrasser · March 11, 2020, 6:07pm

Hi @olibclarke, we’ve been looking into this issue over the last couple of days and we have a potential fix! We’re still doing some more testing and we’ll update you shortly with release information.

olibclarke · March 11, 2020, 6:44pm

Great look forward to testing, thanks!!