V4 upgrade - cuInit failed: no CUDA-capable device is detected

Greetings.

We upgraded one of our workstations from the latest V3 build to the latest V4 build. All appeared to go well.

We are getting the following error with jobs:

[CPU: 392.0 MB]
Traceback (most recent call last): File “cryosparc_worker/cryosparc_compute/run.py”, line 93, in cryosparc_compute.run.main File “cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py”, line 630, in cryosparc_compute.jobs.helix.run_refine.run File “/executor/opt/cryoem/cryosparc/cryosparc_worker/cryosparc_compute/alignment.py”, line 113, in align_symmetry cuda_core.initialize([cuda_dev]) File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 29, in cryosparc_compute.engine.cuda_core.initialize pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected

But cryosparcw does see the gpus:

cryosparcw gpulist
Detected 2 CUDA devices.

id pci-bus name

   0      0000:21:00.0  GeForce RTX 2080 Ti
   1      0000:31:00.0  GeForce RTX 2080 Ti

Please can you confirm:

  1. this a single-workstation combined master/worker CryoSPARC instance.
  2. all update steps where performed on this specific computer (as opposed to performed on another computer that may share the same filesystem)

and post:

Of course, glad to provide this…

It’s a single system, but it is running SLURM for multiuser, so it is configured more like a cluster using cluster_script.sh.

Steps to update:

cryosparcm stop
cryosparcm update
cd cryosparc_worker/
cp …/cryosparc_master/cryosparc_worker.tar.gz .
cryosparcw update

Troubleshooting feedback:

echo $CRYOSPARC_CUDA_PATH
/opt/local/cuda-10.0

${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

python -c “import pycuda.driver; print(pycuda.driver.get_version())”
(10, 0, 0)

uname -a && free -g && nvidia-smi
Linux executor.structbio.pitt.edu 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 503 131 74 0 297 370
Swap: 15 0 15
Tue Dec 6 11:59:55 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84 Driver Version: 460.84 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:21:00.0 Off | N/A |
| 27% 34C P8 19W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:31:00.0 Off | N/A |
| 27% 32C P8 20W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I wonder whether your cluster script might be using CUDA_VISIBLE_DEVICES to avoid the use of “busy” devices, and found only “busy” devices when the job was run?
Please can you e-mail us the job report.

Hmm, it’s possible as we are using CUDA_VISIBLE_DEVICES in the script template, but it’s strange that it never happened to us on V3. It also seems to happen even if I’m sure a GPU is available and no one else is using the system.

I will collect this log and email it as requested – what address should I send this to?

Thank you

I sent you a forum direct message.