I often get these error messages when my jobs are running. The jobs still complete and are not dead. This causes problems with queueing, as the next job in the queue will launch. This can then crash both jobs as they are launched on the same GPU.
I think this is often caused by the slightly laggy networked file system I am storing my project directories in.
Would it be possible for you to extend the ‘Unresponsive time’ maybe to 120 secs to reduce the regularity of these warnings and make it a bit more amenable to slightly sub-optimal file storage setups?
Thanks a lot.