2D classification out of memory when running multiple jobs

Dear developers,

I’m running into a weird situation when running 2D classification.

I have a 4 V100 gpu card server with 384 GB memory. When I run 4 identical 2D jobs (each using 1 card), one of the 4 jobs die but the other three remains alive and runs to the end.

This behavior is somehow random, as sometimes the job dies after a few iterations (5-15), sometimes, it dies at the first iteration. Could you give me some advice on how to avoid this kind of sporadic OOM error?

Many thanks in advance!
Sincerely yours,
Gaoxing