RuntimeError: cuInit failed: no CUDA-capable device is detected

Hi,
I’m having a weird issue with cryosparc cannot see the graphic cards on the node(s).
Here’s the error msg:

Traceback (most recent call last):
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 82, in cryosparc2_compute.run.main
  File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 61, in cryosparc2_compute.jobs.template_picker_gpu.run.run
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 25, in cryosparc2_compute.engine.cuda_core.initialize
RuntimeError: cuInit failed: no CUDA-capable device is detected

Here’s the output of cryosparcw gpulist:
$ bin/cryosparcw gpulist
Detected 4 CUDA devices.

id pci-bus name

   0      0000:02:00.0  GeForce GTX 1080 Ti
   1      0000:03:00.0  GeForce GTX 1080 Ti
   2      0000:81:00.0  GeForce GTX 1080 Ti
   3      0000:82:00.0  GeForce GTX 1080 Ti

After installing the new NVIDIA driver and its corresponding CUDA version 9.2, I updated the cuda path with “newcuda” option (no issue with pycuda-2019.1 re-installation process) and I can see the gpu list with no issue.

However, the actual job execution repeatedly emits the error msg shown above.
My guess is that the worker is using a specific python interpreter that is not updated by cryosparcw but I still have no clue to fix this issue.

If anyone has an insight on this, i would very much appreciate it.

Thanks.

best,
hee jong kim

Hi @heejongkim,

Is it possible if you can check if your issue is related to this one?

Unfortunately not.
We don’t even have a conda on this cluster.
Others that have conda have no issue with it.

Hey @heejongkim,

Can we see your cluster_submission.sh that you used to connect to your cluster? You can run the command cryosparcm cluster dump to get these scripts.

After trying quite different ways, tentatively I made it working again at least.

Here’s the brief summary for the record:

  1. initially, worker was installed under NVIDIA driver 396.37 with Cuda 8 (4x 1080Ti)
  2. After using newcuda with Cuda 9.2, the symptom started.
  3. No matter what I do, the command line says it can see all gpus with no issue and web gui resource tab also indicates all the cpu and gpus correctly BUT it just continuously failed with the same error above.
  4. By reverting back to cuda 8 with cryosparcw newcuda indeed resolved the issue.

My suspicion is there might be a “python environment referencing issue” deep inside of cryosparc.
Due to high demands of the node, I can’t really do anything more than this but I’m planning to fetch the node image back which was made before cryosparc installation to narrow down the issue between master and worker.

If you have any other suggestions, please enlighten me.

Thanks.