Here is htop output at stalled state:
I tried disabling “transparent hugepages” but it didn’t make any difference, at least to the currently running stalled job. (just tested - also doesn’t help on restart).
EDIT:
For comparison, here is htop during the same job prior to stalling:
Once it stalls, a single thread maxes out at 100% and stays there, with all the others at zero…
I tried increasing the oversubscription threshold so it only processes one mic per GPU, and reducing or increasing the memory fraction, same behavior.
EDIT2:
Restarting the system seems to have solved the problem (or at least it has progressed further than it did before)…