New cluster nodes, CUDA detection error

Greetings.

We added 4 new nodes to our cluster with NVIDIA RTX A6000 cards. The installed driver version is 470.57.02.

We are able to run GPU jobs for other software, and nvidia-smi detects the cards and displays the proper driver version.

However, for cryosparc jobs, we are getting the following error:

[CPU: 2.63 GB] Traceback (most recent call last): File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main File “cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py”, line 332, in cryosparc_compute.jobs.refine.newrun.run_homo_refine File “/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/alignment.py”, line 113, in align_symmetry cuda_core.initialize([cuda_dev]) File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 29, in cryosparc_compute.engine.cuda_core.initialize pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected

Please can enter a shell on one of the new cluster nodes and execute under the Linux account of the CryoSPARC instance:

Sure – here you go…

[cryosparc_user@node05 bin]$ eval $(/opt/cryoem/cryosparc/cryosparc2_worker/bin/cryosparcw env)

[cryosparc_user@node05 bin]$ echo $CRYOSPARC_CUDA_PATH
/cm/shared/apps/cuda10.2/toolkit/10.2.89

[cryosparc_user@node05 bin]$ ${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

[cryosparc_user@node05 bin]$ python -c “import pycuda.driver; print(pycuda.driver.get_version())”
(10, 2, 0)

[cryosparc_user@node05 bin]$ uname -a && free -g && nvidia-smi
Linux node05 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 503 3 499 0 0 497
Swap: 15 0 15
Thu Oct 6 10:11:03 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | Off |
| 30% 26C P8 7W / 300W | 0MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off |
| 30% 28C P8 8W / 300W | 0MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

You may need to configure a newer version of the toolkit that supports the A6000 cards for CRYOSPARC_CUDA_PATH.
(guide).

I’ll test this and report back.

What is the latest CUDA version supported by Cryosparc?

The latest version today is 11.8, but the documentation seems to suggest that cryosparc only supports up to version 11.2. Is this accurate?

Thanks!

We recently ran test workflows with 11.8. For CentOS-7 I remember an isolated occurrence of cufftInvalidPlan during Patch Motion Correction, which we could not reproduce in follow-up testing. My recommendation would be to

  1. run the installation using a v11.8 runfile

    • on a CentOS-7 box
    • with nvidia driver installed
    • as a non-root user with the --toolkit, --toolkitpath=, --defaultroot= (matching toolkitpath) options
  2. cryosparcw newcuda .. as needed

  3. test with workflow, “Run all job types” enabled

1 Like

Excellent. I will try this once after get our installation stabilized from my other open ticket issue.

Appreciate the guidance.