Improve GPU validation tests

ebirn · September 24, 2024, 10:18am

Hi all,
After upgrading to Cryosparc v4.6.0 we ran the validation tests as per usual (launch, SSD, GPU). All test jobs completed successfully. This is a useful feature to quickly validate the instance functionality after changes and updates.

However, the GPU test failed to detect an issue with the numba library in the worker env, which was not linked against the correct/latest CUDA version (it was running against < 11.6 but should’ve been running with CUDA 12.x).

For this reason I would like to propose to modify the GPU validation test (cryosparcm test workers) so that it does a minimal compute example to check essential libraries. In our specific case, the test was fine, but jobs running then hit a CUDA init error when loading the python numba library. The error was:

    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_UNKNOWN (999)

Best,
Erich

nfrasser · September 26, 2024, 4:12pm

Hi @ebirn, thank you for reporting this. We’ve recorded this and will think of ways to mitigate this in the future. For my information, what did you have to change on your end to get this to work?

ebirn · October 7, 2024, 1:48pm

Our fix was to update the cuda version that cryosparc uses. This was happening on a cluster, where we have multiple cuda versions installed in parallel and switch them by environment variable. Updating the env var for cryosparc was a quick and easy fix in this case. to be sure, we re-installed the cryosparc worker on the cluster, to be sure if there was binary code generated during install, that it was bulit by the correct cuda version.

wtempel · October 7, 2024, 6:37pm

Thanks @ebirn . Please can you show the code that sets the relevant environment for CryoSPARC jobs and where that code is included. If the code is included in the cluster script template, please can you post the template?