Heterogeneous refinement failed with CudaAPIError

Hi guys,
i got error message when i run heterogeneous refinement. it works well before i updated cryosparc. Does anyone know what happened?
Traceback (most recent call last):
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 258, in ensure_initialized
self.cuInit(0)
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_NO_DEVICE] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_master.cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/hetero_refine/run.py”, line 298, in cryosparc_master.cryosparc_compute.jobs.hetero_refine.run.run_hetero_refine
File “cryosparc_master/cryosparc_compute/jobs/hetero_refine/run.py”, line 276, in cryosparc_master.cryosparc_compute.jobs.hetero_refine.run.run_hetero_refine.process_images
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 950, in cryosparc_master.cryosparc_compute.engine.engine.process
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 47, in cryosparc_master.cryosparc_compute.gpu.gpucore.initialize
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 220, in get_context
return _runtime.get_or_create_context(devnum)
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 144, in get_or_create_context
return self._activate_context_for(devnum)
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 176, in _activate_context_for
gpu = self.gpus[devnum]
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 40, in getitem
return self.lst[devnum]
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 26, in getattr
numdev = driver.get_device_count()
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 425, in get_device_count
return self.cuDeviceGetCount()
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 296, in getattr
self.ensure_initialized()
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 262, in ensure_initialized
raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)

What are the outputs of these commands on the worker node where you observed CUDA_ERROR_NO_DEVICE:

nvidia-smi --query-gpu=index,name,driver_version --format=csvindex, name, driver_version
/home/xx/cryosparc/cryosparc_worker/bin/cryosparcw gpulist

For the first command: nvidia-smi --query-gpu=index,name,driver_version --format=csvindex, name, driver_version
output: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
for the second command:/home/xx/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
output:Traceback (most recent call last):
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 258, in ensure_initialized
self.cuInit(0)
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_NO_DEVICE] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “”, line 1, in
File “/home/xx/cryosparc/cryosparc_worker/bin/connect.py”, line 28, in print_gpu_list
num_devs = len(cuda.gpus)
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 49, in len
return len(self.lst)
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py”, line 26, in getattr
numdev = driver.get_device_count()
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 425, in get_device_count
return self.cuDeviceGetCount()
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 296, in getattr
self.ensure_initialized()
File “/home/xx/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 262, in ensure_initialized
raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)

is the kind of error that may occur after a nvidia driver update and might be cured by a system reboot.

we just reboot the server a few days ago, besides this way, Could we run some command at the terminal to update the nvidia driver?

Some computers are configured to update software automatically in the background. If the nvidia driver was updated in this way, nvidia-related commands may be broken until the system is rebooted.

You may want to contact your IT support for help with the nvidia driver (v520 or newer for CryoSPARC v4.4) and a functional nvidia-smi command.