Hello,
I just upgraded to v4.4.1 and I got errors saying that my CUDA drivers were incompatible. I had a quite old driver version installed and was running CUDA 10.1. I updated my CUDA drivers and toolkit to the latest version. I am running CentOS 7 with GeForce 2080 TI GPU cards.
After updated these drivers, I get errors very similar to this thread: CUDA issue after updating to v.4.4 - #6 by KyleBarrie
However, that issue seemed to be a version mismatch between nvidia-smi and the drivers. I don’t have this issue, based on the output of $nvidia-smi:
$ nvidia-smi
Fri Feb 16 14:22:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:18:00.0 Off | N/A |
| 31% 33C P8 3W / 250W | 18MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:3B:00.0 On | N/A |
| 31% 33C P8 6W / 250W | 38MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 2080 Ti Off | 00000000:86:00.0 Off | N/A |
| 32% 32C P8 3W / 250W | 7MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 2080 Ti Off | 00000000:AF:00.0 Off | N/A |
| 30% 31C P8 3W / 250W | 7MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
$ ./bin/cryosparcw call env | grep PATH
MANPATH=/usr/share/man/openmpi-x86_64:/usr/share/man:/usr/local/share/man
NUMBA_CUDA_INCLUDE_PATH=/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:/data/software/repo/relion/4.0.1/lib:/usr/lib64/openmpi/lib:/usr/local/cuda-8.0/lib64:/usr/local/cuda-10.1/lib64:/usr/local/bsoft/lib:/usr/local/lib:/usr/local/lib
PATH=/data/software/cryosparc/cryosparc2_worker/bin:/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/data/software/cryosparc/cryosparc2_worker/deps/anaconda/condabin:/home/spuser/anaconda3/bin:/home/spuser/anaconda3/condabin:/data/software/cryosparc/cryosparc2_master/bin:/opt/bin:/opt/cistem-1.0.0-beta:/opt/frealign_v9.11/bin:/data/software/repo/relion/4.0.1/bin:/opt/pyem:/usr/lib64/openmpi/bin:/usr/local/cuda-10.1/bin:/usr/local/cuda-8.0/bin:/opt/bin:/opt/cistem-1.0.0-beta:/opt/frealign_v9.11/bin:/usr/lib64/qt-3.3/bin:/usr/local/bsoft/bin:/usr/local/MATLAB/R2021a/bin:/usr/local/phenix-1.19.2-4158/build/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/appion:/home/spuser/.local/bin:/home/spuser/bin
MODULEPATH=/opt/sp/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles
LIBTBX_OPATH=
CRYOSPARC_PATH=/data/software/cryosparc/cryosparc2_worker/bin
PYTHONPATH=/data/software/cryosparc/cryosparc2_worker
QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins
CRYOSPARC_CUDA_PATH=/usr/local/cuda
However, now cryoSPARC is not talking to my GPUs:
$ ./bin/cryosparcw connect --worker localhost --master localhost --port 39000 --gpus 0,1,2,3
---------------------------------------------------------------
CRYOSPARC CONNECT --------------------------------------------
---------------------------------------------------------------
Attempting to register worker localhost to command localhost:39002
Connecting as unix user spuser
Will register using ssh string: spuser@localhost
If this is incorrect, you should re-run this command with the flag --sshstr <ssh string>
---------------------------------------------------------------
Connected to master.
---------------------------------------------------------------
Current connected workers:
baker1
---------------------------------------------------------------
Worker will be registered with 64 CPUs.
Autodetecting available GPUs...
Traceback (most recent call last):
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
self.cuInit(0)
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_UNKNOWN] Call to cuInit results in CUDA_ERROR_UNKNOWN
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bin/connect.py", line 233, in <module>
gpu_devidxs = check_gpus()
File "bin/connect.py", line 101, in check_gpus
num_devs = print_gpu_list()
File "bin/connect.py", line 28, in print_gpu_list
num_devs = len(cuda.gpus)
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 49, in __len__
return len(self.lst)
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 26, in __getattr__
numdev = driver.get_device_count()
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 425, in get_device_count
return self.cuDeviceGetCount()
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
self.ensure_initialized()
File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_UNKNOWN (999)
Where should I start to debug this issue?
Thanks!
Rick