"not a valid CUresult" error

tsengoku · March 1, 2024, 7:52am

Hi.

We have 4 PCs.
Master: 163.212.164.235
Workers: 163.212.164.235, 163.212.164.234 (and two others)
163.212.164.234 has 2 GTX 1080Tis, and I am having trouble getting it to work after upgrading to CS v4.4.1. (The other three workers are fine).

When I tried to connect it to the master, I got the following error.

cryosparcw connect --worker 163.212.164.234 --master 163.212.164.235 --port 39000 --gpus 0,1 --update
 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker 163.212.164.234 to command 163.212.164.235:39002
  Connecting as unix user cryo2
  Will register using ssh string: cryo2@163.212.164.234
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string>
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    163.212.164.235
    163.212.164.244
    163.212.164.234
    163.212.164.230
 ---------------------------------------------------------------
  Worker will be registered with 64 CPUs.
 ---------------------------------------------------------------
  Updating target 163.212.164.234
  Current configuration:
               cache_path :  /ssd
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 25428623360, 'name': 'NVIDIA GeForce RTX 3090'}]
                 hostname :  163.212.164.234
                     lane :  default
             monitor_port :  None
                     name :  163.212.164.234
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,                                 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56                                , 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
                  ssh_str :  cryo2@163.212.164.234
                    title :  Worker node 163.212.164.234
                     type :  node
          worker_bin_path :  /home/users/cryo2/cryosparc2/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------
  Autodetecting available GPUs...
Traceback (most recent call last):
  File "bin/connect.py", line 165, in <module>
    gpu_devidxs = check_gpus()
  File "bin/connect.py", line 101, in check_gpus
    num_devs = print_gpu_list()
  File "bin/connect.py", line 28, in print_gpu_list
    num_devs = len(cuda.gpus)
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba                                /cuda/cudadrv/devices.py", line 49, in __len__
    return len(self.lst)
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba                                /cuda/cudadrv/devices.py", line 26, in __getattr__
    numdev = driver.get_device_count()
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba                                /cuda/cudadrv/driver.py", line 425, in get_device_count
    return self.cuDeviceGetCount()
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba                                /cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba                                /cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba                                /cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "cuda/cuda.pyx", line 11326, in cuda.cuda.cuInit
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/enum.py", line 339,                                 in __call__
    return cls.__new__(cls, value)
  File "/home/users/cryo2/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/enum.py", line 663,                                 in __new__
    raise ve_exc
ValueError: 4294967295 is not a valid CUresult

How can I fix this error?

Thanks,
Toru

wtempel · March 1, 2024, 2:13pm

Welcome to the forum @tsengoku .
Please confirm:

cryosparcw connect is being run on the applicable worker.
The nvidia driver is at version 520 or above.

What are the outputs of these commands:

hostname -f
host $(hostname -f)
ip addr | grep '163.212.164.23'
nvidia-smi

tsengoku · March 2, 2024, 1:44am

Thanks, @wtempel.

Yes
It’s 525.78.01 (see below).

cryo2@EMPC2:~$ hostname -f
EMPC2
cryo2@EMPC2:~$ host $(hostname -f)
EMPC2 has address 127.0.1.1
cryo2@EMPC2:~$ ip addr | grep '163.212.164.23'
    inet 163.212.164.234/24 brd 163.212.164.255 scope global noprefixrou                                       te enp11s0
cryo2@EMPC2:~$ nvidia-smi
Sat Mar  2 10:36:09 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0C:00.0 Off |                  N/A |
| 23%   27C    P8     8W / 250W |      2MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:42:00.0 Off |                  N/A |
| 25%   30C    P8    11W / 250W |     89MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A      2752      G   /usr/lib/xorg/Xorg                 33MiB |
|    1   N/A  N/A      2938      G   /usr/bin/gnome-shell               53MiB |
+-----------------------------------------------------------------------------+

nfrasser · March 6, 2024, 7:26pm

Hi @tsengoku, we are still investigating this. Could you also please post the output of the following command?

cryosparcw call numba -s

nfrasser · March 6, 2024, 7:35pm

One more question: When was your Nvidia driver last updated? Have you restarted since that update?

nfrasser · May 7, 2024, 7:07pm

For anyone else that encounters this with CryoSPARC v4.4, please try updating to the latest v4.5 release or newer. It that does not work, you may try downgrading to v4.3.1:

cryosparcm update --version=v4.3.1