Dear developers,
I’m running into a weird situation when running 2D classification.
I have a 4 V100 gpu card server with 384 GB memory. When I run 4 identical 2D jobs (each using 1 card), one of the 4 jobs die but the other three remains alive and runs to the end.
This behavior is somehow random, as sometimes the job dies after a few iterations (5-15), sometimes, it dies at the first iteration. Could you give me some advice on how to avoid this kind of sporadic OOM error?
Many thanks in advance!
Sincerely yours,
Gaoxing