RTX A5000 cuda memory bug

abasle · December 18, 2023, 6:04pm

Hello,

I think the GPU memory bug may be back.
We are using RTX A5000 (24 GB).
Cryosparc v4.4.1.
Few jobs are not working for example an extract job that works with CPUs or one GPU fails using 2 GPUS:
line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuMemcpyHtoDAsync results in CUDA_ERROR_ILLEGAL_ADDRESS

Marking J2/motioncorrected/002042912766554708070_movie_02304_group50_patch_aligned_doseweighted.mrc as incomplete and continuing…

Is there a way to test the local cuda install and force cryosaprc to use the local cuda install rather than the distributed cuda with cryosparc?

Additionally running many jobs, one gpu gives this:
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] cuModuleLoadDataEx error:
Marking J2/motioncorrected/010981471637791344920_movie_00001_group85_patch_aligned_doseweighted.mrc as incomplete and continuing…

Cheers,
Arnaud

rbs_sci · December 19, 2023, 12:27am

What driver version are you using?

Do you have one a single A5000?

If you have multiple cards, could the error be occurring only on one card?

abasle · December 27, 2023, 8:00am

Wed Dec 27 07:57:21 2023
±--------------------- | NVIDIA-SMI 520.61.05 |---------------------- | GPU Name | Fan Temp Perf Pwr:Usage/Cap| | |====================== | 0 NVIDIA RTX A5000 | 30% 25C P8 | ±--------------------- | 1 NVIDIA RTX A5000 | 30% 27C P8 | ±--------------------- | 2 NVIDIA RTX A5000 | 30% 26C P8 | ±--------------------- | 3 NVIDIA RTX A5000 | 30% 25C P8 | ±--------------------- | 4 NVIDIA RTX A5000 | 30% 25C P8 | ±--------------------- | 5 NVIDIA RTX A5000 | 30% 24C P8 | ±--------------------- | 6 NVIDIA RTX A5000 | 30% 25C P8 | ±--------------------- | 7 NVIDIA RTX A5000 | 30% 24C P8 | ±--------------------- -------------------------------------------------------+
Driver Version: 520.61.05 CUDA Version: 11.8 |
---------±---------------------±---------------------+
Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
Memory-Usage | GPU-Util Compute M. |
| | MIG M. |
=========+======================+======================|
On | 00000000:17:00.0 Off | Off |
15W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:3D:00.0 Off | Off |
21W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:50:00.0 Off | Off |
14W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:63:00.0 Off | Off |
16W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:99:00.0 Off | Off |
23W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:BD:00.0 Off | Off |
17W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:CF:00.0 Off | Off |
15W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+
On | 00000000:E1:00.0 Off | Off |
18W / 230W | 5MiB / 24564MiB | 0% Default |
| | N/A |
---------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 3310 G /usr/lib/xorg/Xorg 4MiB |

We have two servers with 8 card each. I’d rather not blame the hardware at this stage.

I’d like to test older version of cryosparc but I have not managed to try yet.

Also I wanted to try to have the latest cryosparc to use our loal install of cuda but I’m not sure on how to it with cuda now shipped with cryosparc.

Mark-A-Nakasone · January 1, 2024, 9:00pm

Did you run the Cryosparc “Test Worker GPUs” Job, does it complete without errors on advanced settings ? I have a few workstations running 2x A5000s but I’ve updated the driver versions to 535.

abasle · January 1, 2024, 9:17pm

Hi Mark-A-Nakasone,

Thanks a lot I had not thought about that. The tests runs fine and do complete.

I’ll try updating the driver to 535 and see if better.

Cheers,
Arnaud

abasle · January 2, 2024, 7:58am

Unfortunately still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuMemcpyHtoDAsync results in CUDA_ERROR_ILLEGAL_ADDRESS

Marking J152/motioncorrected/001841541197226233398_movie_00164_group22_patch_aligned_doseweighted.mrc as incomplete and continuing…

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:17:00.0 Off | Off |
| 30% 32C P2 60W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:3D:00.0 Off | Off |
| 30% 33C P2 58W / 230W | 3624MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:50:00.0 Off | Off |
| 30% 32C P2 59W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:63:00.0 Off | Off |
| 30% 32C P2 64W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA RTX A5000 Off | 00000000:99:00.0 Off | Off |
| 30% 32C P2 65W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 5 NVIDIA RTX A5000 Off | 00000000:BD:00.0 Off | Off |
| 30% 31C P2 59W / 230W | 5920MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 6 NVIDIA RTX A5000 Off | 00000000:CF:00.0 Off | Off |
| 30% 32C P2 61W / 230W | 4816MiB / 24564MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 7 NVIDIA RTX A5000 Off | 00000000:E1:00.0 Off | Off |
| 30% 31C P2 62W / 230W | 4816MiB / 24564MiB | 9% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

Cuda is now 12.2 but cryosparc is I think uding its own cuda. Not clear though as I don’t know how to confirm which cuda version is crysoparc using and how to change it to test different versions.

anyone would know?

Cheers,
Arnaud

rbs_sci · January 2, 2024, 9:41am

CryoSPARC doesn’t work with CUDA 12, which is why it now bundles it’s own CUDA 11.8.

Please check dmesg output from when the error occurs, as the CryoSPARC error output, as you have discovered, doesn’t go into fine detail. dmesg will identify the PCI address of the fault. If it is consistent, it might indicate a failing card.

There are multiple reasons why I originally suggested faulty hardware. Other users have had the same error with a(n apparently) failing card, and I had all sorts of weird issues on a system where one ECC-RDIMM DRAM stick had failed.

A quick test is simple: assign one GPU at a time to the job, and if one throws errors, you have your culprit.

Or roll back to 4.3.1 and test with that as well.

Mark-A-Nakasone · January 4, 2024, 9:50am

This is very strange @abasle the GPU test completed the tests. Before v4.4.1 we had a work around to point cryosparc master and worker to CUDA v11.8 since putting the higher NVIDIA drivers had CUDA v12x as default. I used that job to see when the GPU workers actually worked with 1) NVIDIA driver and 2) CUDA vesion. With the recent version update the bundled CUDA v11.8 worked fine. We have a lot of Anaconda environments working in similar fashion. I agree with @rbs_sci to trace the hardware issue.

From the output of @abasle $nvidia-smi it is CUDA v12.2, but what is the CUDA path in cryosparc_worker & _master ? --cudapath /usr/local/cuda \ would likely be incorrect.

If you have the time you can take @rbs_sci recommendation to the extreme and run the Extensive Validation job choosing which resources (GPUs) so make a Project and workspace for each GPU and run the job, selecting only one GPU per Extensive Validation job and maybe all GPUs for another.