Hi, I constantly get a “GPU lost” error when using 8x (or a subset) of 3090 GPU cards to perform several local refinement job in parallel (box size 512 pixel).
The error is like this after running “nvidia-smi”:
Unable to determine the device handle for GPU 0000:1C:00.0: GPU is lost. Reboot the system to recover this GPU
If I exclude GPU 1C from the available list, GPU 1B will be lost instead, so the issue is unlikely to be an single defective GPU card.