Card dropping during 2D classification?

I keep getting “cuMemHostAlloc failed: out of memory” randomly on some 2D classification jobs. When I clone and repeat they may finish fine. One job died at iteration 6. The datasets are not huge (40-70gb, 200-1,000k particles) into 50 or 200 classes and I am pretty sure it’s not using this much memory. The fail rate seemed higher when running with 2 GPUs. Now the fail rate seems to be increasing with just 1 GPU.

Is it possible the 3080Ti cards are dropping or overheating? Is there a diagnostic I can run while the job is running? I’m on v3.3.1+220315.

Unless I’m mistaken, that OOM message relates to your system RAM rather than VRAM. If this is a cluster submission, your job may be being terminated by the scheduler when it consumes more than what was requested.

You may want to review the exact setup of your cluster submission script. The default {{ ram_gb }} value for the class_2d jobtype is often limiting. You may wish to either a) create a bespoke lane with, e.g. {{ (ram_gb*2)|int }} (or similar) for such situations, or b) edit the job-specific memory requirement as suggested here.

Cheers,
Yang

Seems like the same error occurring here and possibly centOS7 related. The fact that a job can die on iteration 6 (out of 40) and can then complete after cloning seems like a bug or an intermittent hardware related issue. There’s no other GPU or RAM-consuming process running when this dies.

Thanks for the response, @leetleyang. I’ll continue this discussion in the other post.