We added 4 new nodes to our cluster with NVIDIA RTX A6000 cards. The installed driver version is 470.57.02.
We are able to run GPU jobs for other software, and nvidia-smi detects the cards and displays the proper driver version.
However, for cryosparc jobs, we are getting the following error:
[CPU: 2.63 GB]Traceback (most recent call last): File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main File “cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py”, line 332, in cryosparc_compute.jobs.refine.newrun.run_homo_refine File “/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/alignment.py”, line 113, in align_symmetry cuda_core.initialize([cuda_dev]) File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 29, in cryosparc_compute.engine.cuda_core.initialize pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
[cryosparc_user@node05 bin]$ uname -a && free -g && nvidia-smi
Linux node05 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 503 3 499 0 0 497
Swap: 15 0 15
Thu Oct 6 10:11:03 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | Off |
| 30% 26C P8 7W / 300W | 0MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off |
| 30% 28C P8 8W / 300W | 0MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
We recently ran test workflows with 11.8. For CentOS-7 I remember an isolated occurrence of cufftInvalidPlan during Patch Motion Correction, which we could not reproduce in follow-up testing. My recommendation would be to
run the installation using a v11.8 runfile
on a CentOS-7 box
with nvidia driver installed
as a non-root user with the --toolkit, --toolkitpath=, --defaultroot= (matching toolkitpath) options