Hi,
On our new GPU (2xRTX-3090, CentOS7, CSParc 3.2), we are sometimes finding jobs launching prematurely from the queue.
Job 1 will be running, and Job 2 is listed as queued (GPU not available), but then Job 2 starts halfway through the runtime of Job 1 (before the GPU is free), and both jobs crash with a CUDA MemAlloc error. Thoughts? I have not seen this on our other systems.
Cheers
Oli
Hi @olibclarke,
If you look at Job 1, in the streamlog showing from the start, filter by type traceback - do you see any heartbeat errors?
This would happen if at some point job 1 temporarily stalled and reported as dead allowing job 2 to start
Hi @apunjani,
I deleted the jobs so I can’t check, but I think it was a hardware glitch - shortly after that one of the GPU cards went completely offline, and everything seems normal after a reboot.
Cheers
Oli