RTX A5000 cuda memory bug

rbs_sci · January 2, 2024, 9:41am

CryoSPARC doesn’t work with CUDA 12, which is why it now bundles it’s own CUDA 11.8.

Please check dmesg output from when the error occurs, as the CryoSPARC error output, as you have discovered, doesn’t go into fine detail. dmesg will identify the PCI address of the fault. If it is consistent, it might indicate a failing card.

There are multiple reasons why I originally suggested faulty hardware. Other users have had the same error with a(n apparently) failing card, and I had all sorts of weird issues on a system where one ECC-RDIMM DRAM stick had failed.

A quick test is simple: assign one GPU at a time to the job, and if one throws errors, you have your culprit.

Or roll back to 4.3.1 and test with that as well.