I think the GPU memory bug may be back.
We are using RTX A5000 (24 GB).
Cryosparc v4.4.1.
Few jobs are not working for example an extract job that works with CPUs or one GPU fails using 2 GPUS:
line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuMemcpyHtoDAsync results in CUDA_ERROR_ILLEGAL_ADDRESS
Marking J2/motioncorrected/002042912766554708070_movie_02304_group50_patch_aligned_doseweighted.mrc as incomplete and continuing…
Is there a way to test the local cuda install and force cryosaprc to use the local cuda install rather than the distributed cuda with cryosparc?
Additionally running many jobs, one gpu gives this:
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] cuModuleLoadDataEx error:
Marking J2/motioncorrected/010981471637791344920_movie_00001_group85_patch_aligned_doseweighted.mrc as incomplete and continuing…
Did you run the Cryosparc “Test Worker GPUs” Job, does it complete without errors on advanced settings ? I have a few workstations running 2x A5000s but I’ve updated the driver versions to 535.
Unfortunately still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_ILLEGAL_ADDRESS] Call to cuMemcpyHtoDAsync results in CUDA_ERROR_ILLEGAL_ADDRESS
Marking J152/motioncorrected/001841541197226233398_movie_00164_group22_patch_aligned_doseweighted.mrc as incomplete and continuing…
Cuda is now 12.2 but cryosparc is I think uding its own cuda. Not clear though as I don’t know how to confirm which cuda version is crysoparc using and how to change it to test different versions.
CryoSPARC doesn’t work with CUDA 12, which is why it now bundles it’s own CUDA 11.8.
Please check dmesg output from when the error occurs, as the CryoSPARC error output, as you have discovered, doesn’t go into fine detail. dmesg will identify the PCI address of the fault. If it is consistent, it might indicate a failing card.
This is very strange @abasle the GPU test completed the tests. Before v4.4.1 we had a work around to point cryosparc master and worker to CUDA v11.8 since putting the higher NVIDIA drivers had CUDA v12x as default. I used that job to see when the GPU workers actually worked with 1) NVIDIA driver and 2) CUDA vesion. With the recent version update the bundled CUDA v11.8 worked fine. We have a lot of Anaconda environments working in similar fashion. I agree with @rbs_sci to trace the hardware issue.
From the output of @abasle $nvidia-smi it is CUDA v12.2, but what is the CUDA path in cryosparc_worker & _master ? --cudapath /usr/local/cuda \ would likely be incorrect.
If you have the time you can take @rbs_sci recommendation to the extreme and run the Extensive Validation job choosing which resources (GPUs) so make a Project and workspace for each GPU and run the job, selecting only one GPU per Extensive Validation job and maybe all GPUs for another.