Why my computer keeps losing GPU in the middle of processing

Hi folks,

The GPU computer I am running CryoSPARC keeps losing GPU during image processing. The error message is “pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal”. The return of nvidia-smi is “Unable to determine the device handle for GPU 0000:3B:00.0: GPU is lost. Reboot the system to recover this GPU”. We keep restarting the computer and it keeps happening. How can we fix the issue?

Thanks,
Guobin

Hi Guobin,

This is a common problem for us as well. My understanding is it’s related to the known issues with transient spikes in the power draw by the GPUs. If it’s happening when you’re running on all GPUs you can try using fewer resources until you get past that job. For us, the problem does seem to be more of an issue earlier in the processing when particle stacks are larger and the alignments are worse. If you can reproduce the problem when running only on the problematic GPU then it could be a specific hardware issue.

Best,

Ryan

Hi Ryan,

Thank you very much for your input! It is really helpful!
It seems that all GPUs (not just one particular GPU) are lost on my computer when it happens. Was it the same when it happened to your computer?

Best,
Guobin

Hi Guobin,

I’ve seen a range of different symptoms. Sometimes a single GPU will just disappear from nvidia-smi and sometimes nvidia-smi will give the error you posted. I was assuming that it was GPU 0 based on “GPU 0000:3B:00.0:”

Other times when this happens the computer will completely crash and reboot. Try running the job with one fewer GPU. If you were only running it on one GPU when this happened then try restarting the job on a different one.

How many GPUs and what power supply does this computer have?

Ryan

Hi Ryan,

Thanks again for your input! I will try fewer GPU to see how it goes. There are 4 GPUs. But it seems that only 3 are allowed to run. Not sure about the power supply. It is strictly controlled by the ITD due to cyber security policy.

Best,
Guobin

1 Like

It’s a long shot, but can you provide any information about the CPUs and chipset? I had to deal with a Threadripper system which for nearly two years of BIOS updates had a PCI-E bus bug where one of the GPUs would “fall off the bus” randomly. It was always the same GPU - its identical twin was fine. If it happened during any sort of GPGPU load, obviously this lead to a crash. I doubt you can run dmesg to check, if you don’t control the box yourself… but the only way to recover was a reboot.

This sounds awfully similar to issues we’ve had on a system with an AMD Ryzen Threadripper 3960X and ASRock TRX40 Creator motherboard. What was the solution?

A combination; the penultimate BIOS released made the issue happen a lot less frequently. Oddly the most recent BIOS made it worse again. I also moved that system to kernel 5.4 and the issue has not occurred since - it’s currently running 5.15.

1 Like