Seems like the same error occurring here and possibly centOS7 related. The fact that a job can die on iteration 6 (out of 40) and can then complete after cloning seems like a bug or an intermittent hardware related issue. There’s no other GPU or RAM-consuming process running when this dies.
Thanks for the response, @leetleyang. I’ll continue this discussion in the other post.