Upgrade to 4.4.1 leads to CUDA incompatibility

RickBaker · February 16, 2024, 8:41pm

Hello,

I just upgraded to v4.4.1 and I got errors saying that my CUDA drivers were incompatible. I had a quite old driver version installed and was running CUDA 10.1. I updated my CUDA drivers and toolkit to the latest version. I am running CentOS 7 with GeForce 2080 TI GPU cards.

After updated these drivers, I get errors very similar to this thread: CUDA issue after updating to v.4.4 - #6 by KyleBarrie

However, that issue seemed to be a version mismatch between nvidia-smi and the drivers. I don’t have this issue, based on the output of $nvidia-smi:

$ nvidia-smi
Fri Feb 16 14:22:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:18:00.0 Off |                  N/A |
| 31%   33C    P8               3W / 250W |     18MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:3B:00.0  On |                  N/A |
| 31%   33C    P8               6W / 250W |     38MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:86:00.0 Off |                  N/A |
| 32%   32C    P8               3W / 250W |      7MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:AF:00.0 Off |                  N/A |
| 30%   31C    P8               3W / 250W |      7MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

$ ./bin/cryosparcw call env | grep PATH
MANPATH=/usr/share/man/openmpi-x86_64:/usr/share/man:/usr/local/share/man
NUMBA_CUDA_INCLUDE_PATH=/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:/data/software/repo/relion/4.0.1/lib:/usr/lib64/openmpi/lib:/usr/local/cuda-8.0/lib64:/usr/local/cuda-10.1/lib64:/usr/local/bsoft/lib:/usr/local/lib:/usr/local/lib
PATH=/data/software/cryosparc/cryosparc2_worker/bin:/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/data/software/cryosparc/cryosparc2_worker/deps/anaconda/condabin:/home/spuser/anaconda3/bin:/home/spuser/anaconda3/condabin:/data/software/cryosparc/cryosparc2_master/bin:/opt/bin:/opt/cistem-1.0.0-beta:/opt/frealign_v9.11/bin:/data/software/repo/relion/4.0.1/bin:/opt/pyem:/usr/lib64/openmpi/bin:/usr/local/cuda-10.1/bin:/usr/local/cuda-8.0/bin:/opt/bin:/opt/cistem-1.0.0-beta:/opt/frealign_v9.11/bin:/usr/lib64/qt-3.3/bin:/usr/local/bsoft/bin:/usr/local/MATLAB/R2021a/bin:/usr/local/phenix-1.19.2-4158/build/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/appion:/home/spuser/.local/bin:/home/spuser/bin
MODULEPATH=/opt/sp/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles
LIBTBX_OPATH=
CRYOSPARC_PATH=/data/software/cryosparc/cryosparc2_worker/bin
PYTHONPATH=/data/software/cryosparc/cryosparc2_worker
QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins
CRYOSPARC_CUDA_PATH=/usr/local/cuda

However, now cryoSPARC is not talking to my GPUs:

$ ./bin/cryosparcw connect --worker localhost --master localhost --port 39000 --gpus 0,1,2,3
 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker localhost to command localhost:39002
  Connecting as unix user spuser
  Will register using ssh string: spuser@localhost
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string>
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    baker1
 ---------------------------------------------------------------
  Worker will be registered with 64 CPUs.
  Autodetecting available GPUs...
Traceback (most recent call last):
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_UNKNOWN] Call to cuInit results in CUDA_ERROR_UNKNOWN

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/connect.py", line 233, in <module>
    gpu_devidxs = check_gpus()
  File "bin/connect.py", line 101, in check_gpus
    num_devs = print_gpu_list()
  File "bin/connect.py", line 28, in print_gpu_list
    num_devs = len(cuda.gpus)
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 49, in __len__
    return len(self.lst)
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 26, in __getattr__
    numdev = driver.get_device_count()
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 425, in get_device_count
    return self.cuDeviceGetCount()
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/data/software/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_UNKNOWN (999)

Where should I start to debug this issue?

Thanks!
Rick

RickBaker · February 17, 2024, 11:59pm

I think I have fixed the issue. cryoSPARC can find the GPUs and motion correction is running right now.

I purged my machine of Nvidia drivers and did a fresh install.

I would highly recommend letting a sys admin do this, if you have one. I tried a bunch of different ways to upgrade/ install/ remove the Nvidia drivers and more and more things kept breaking. This finally did the trick:

sudo yum -y remove *nvidia*
sudo yum -y install nvidia-driver  nvidia-settings
sudo yum -y install cuda-drivers cuda