Greetings.
We upgraded one of our workstations from the latest V3 build to the latest V4 build. All appeared to go well.
We are getting the following error with jobs:
[CPU: 392.0 MB]
Traceback (most recent call last): File “cryosparc_worker/cryosparc_compute/run.py”, line 93, in cryosparc_compute.run.main File “cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py”, line 630, in cryosparc_compute.jobs.helix.run_refine.run File “/executor/opt/cryoem/cryosparc/cryosparc_worker/cryosparc_compute/alignment.py”, line 113, in align_symmetry cuda_core.initialize([cuda_dev]) File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 29, in cryosparc_compute.engine.cuda_core.initialize pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
But cryosparcw does see the gpus:
cryosparcw gpulist
Detected 2 CUDA devices.
id pci-bus name
0 0000:21:00.0 GeForce RTX 2080 Ti
1 0000:31:00.0 GeForce RTX 2080 Ti
Of course, glad to provide this…
It’s a single system, but it is running SLURM for multiuser, so it is configured more like a cluster using cluster_script.sh.
Steps to update:
cryosparcm stop
cryosparcm update
cd cryosparc_worker/
cp …/cryosparc_master/cryosparc_worker.tar.gz .
cryosparcw update
Troubleshooting feedback:
echo $CRYOSPARC_CUDA_PATH
/opt/local/cuda-10.0
${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
python -c “import pycuda.driver; print(pycuda.driver.get_version())”
(10, 0, 0)
uname -a && free -g && nvidia-smi
Linux executor.structbio.pitt.edu 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 503 131 74 0 297 370
Swap: 15 0 15
Tue Dec 6 11:59:55 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84 Driver Version: 460.84 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:21:00.0 Off | N/A |
| 27% 34C P8 19W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:31:00.0 Off | N/A |
| 27% 32C P8 20W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
I wonder whether your cluster script might be using CUDA_VISIBLE_DEVICES
to avoid the use of “busy” devices, and found only “busy” devices when the job was run?
Please can you e-mail us the job report.
Hmm, it’s possible as we are using CUDA_VISIBLE_DEVICES in the script template, but it’s strange that it never happened to us on V3. It also seems to happen even if I’m sure a GPU is available and no one else is using the system.
I will collect this log and email it as requested – what address should I send this to?
Thank you
I sent you a forum direct message.