"GPU lost" issue with RTX 3090 card

Hi, I constantly get a “GPU lost” error when using 8x (or a subset) of 3090 GPU cards to perform several local refinement job in parallel (box size 512 pixel).

The error is like this after running “nvidia-smi”:
Unable to determine the device handle for GPU 0000:1C:00.0: GPU is lost. Reboot the system to recover this GPU

If I exclude GPU 1C from the available list, GPU 1B will be lost instead, so the issue is unlikely to be an single defective GPU card.

Rui

1 Like

Update: after fully removing the GPUs and re-inserting them, the issue no longer exists.
This seems to be something related to the hardware on my side.

1 Like