RTX A5000 cuda memory bug

Hello,

I think the GPU memory bug may be back.
We are using RTX A5000 (24 GB).
Cryosparc v4.4.1.
Few jobs are not working for example an extract job that works with CPUs or one GPU fails using 2 GPUS:
line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuMemcpyHtoDAsync results in CUDA_ERROR_ILLEGAL_ADDRESS

Marking J2/motioncorrected/002042912766554708070_movie_02304_group50_patch_aligned_doseweighted.mrc as incomplete and continuing…

Is there a way to test the local cuda install and force cryosaprc to use the local cuda install rather than the distributed cuda with cryosparc?

Additionally running many jobs, one gpu gives this:
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] cuModuleLoadDataEx error:
Marking J2/motioncorrected/010981471637791344920_movie_00001_group85_patch_aligned_doseweighted.mrc as incomplete and continuing…

Cheers,
Arnaud

What driver version are you using?

Do you have one a single A5000?

If you have multiple cards, could the error be occurring only on one card?

Wed Dec 27 07:57:21 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:17:00.0 Off | Off |
| 30% 25C P8 15W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A5000 On | 00000000:3D:00.0 Off | Off |
| 30% 27C P8 21W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A5000 On | 00000000:50:00.0 Off | Off |
| 30% 26C P8 14W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A5000 On | 00000000:63:00.0 Off | Off |
| 30% 25C P8 16W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA RTX A5000 On | 00000000:99:00.0 Off | Off |
| 30% 25C P8 23W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 NVIDIA RTX A5000 On | 00000000:BD:00.0 Off | Off |
| 30% 24C P8 17W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 NVIDIA RTX A5000 On | 00000000:CF:00.0 Off | Off |
| 30% 25C P8 15W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 NVIDIA RTX A5000 On | 00000000:E1:00.0 Off | Off |
| 30% 24C P8 18W / 230W | 5MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |

We have two servers with 8 card each. I’d rather not blame the hardware at this stage.

I’d like to test older version of cryosparc but I have not managed to try yet.

Also I wanted to try to have the latest cryosparc to use our loal install of cuda but I’m not sure on how to it with cuda now shipped with cryosparc.

Did you run the Cryosparc “Test Worker GPUs” Job, does it complete without errors on advanced settings ? I have a few workstations running 2x A5000s but I’ve updated the driver versions to 535.

Hi Mark-A-Nakasone,

Thanks a lot I had not thought about that. The tests runs fine and do complete.

I’ll try updating the driver to 535 and see if better.

Cheers,
Arnaud

Unfortunately still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuMemcpyHtoDAsync results in CUDA_ERROR_ILLEGAL_ADDRESS

Marking J152/motioncorrected/001841541197226233398_movie_00164_group22_patch_aligned_doseweighted.mrc as incomplete and continuing…

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:17:00.0 Off | Off |
| 30% 32C P2 60W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:3D:00.0 Off | Off |
| 30% 33C P2 58W / 230W | 3624MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:50:00.0 Off | Off |
| 30% 32C P2 59W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:63:00.0 Off | Off |
| 30% 32C P2 64W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA RTX A5000 Off | 00000000:99:00.0 Off | Off |
| 30% 32C P2 65W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 5 NVIDIA RTX A5000 Off | 00000000:BD:00.0 Off | Off |
| 30% 31C P2 59W / 230W | 5920MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 6 NVIDIA RTX A5000 Off | 00000000:CF:00.0 Off | Off |
| 30% 32C P2 61W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 7 NVIDIA RTX A5000 Off | 00000000:E1:00.0 Off | Off |
| 30% 31C P2 62W / 230W | 4816MiB / 24564MiB | 9% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

Cuda is now 12.2 but cryosparc is I think uding its own cuda. Not clear though as I don’t know how to confirm which cuda version is crysoparc using and how to change it to test different versions.

anyone would know?

Cheers,
Arnaud

CryoSPARC doesn’t work with CUDA 12, which is why it now bundles it’s own CUDA 11.8. :slight_smile:

Please check dmesg output from when the error occurs, as the CryoSPARC error output, as you have discovered, doesn’t go into fine detail. dmesg will identify the PCI address of the fault. If it is consistent, it might indicate a failing card.

There are multiple reasons why I originally suggested faulty hardware. Other users have had the same error with a(n apparently) failing card, and I had all sorts of weird issues on a system where one ECC-RDIMM DRAM stick had failed.

A quick test is simple: assign one GPU at a time to the job, and if one throws errors, you have your culprit.

Or roll back to 4.3.1 and test with that as well.

This is very strange @abasle the GPU test completed the tests. Before v4.4.1 we had a work around to point cryosparc master and worker to CUDA v11.8 since putting the higher NVIDIA drivers had CUDA v12x as default. I used that job to see when the GPU workers actually worked with 1) NVIDIA driver and 2) CUDA vesion. With the recent version update the bundled CUDA v11.8 worked fine. We have a lot of Anaconda environments working in similar fashion. I agree with @rbs_sci to trace the hardware issue.

From the output of @abasle $nvidia-smi it is CUDA v12.2, but what is the CUDA path in cryosparc_worker & _master ? --cudapath /usr/local/cuda \ would likely be incorrect.

If you have the time you can take @rbs_sci recommendation to the extreme and run the Extensive Validation job choosing which resources (GPUs) so make a Project and workspace for each GPU and run the job, selecting only one GPU per Extensive Validation job and maybe all GPUs for another.