Issue in launching GPU Jobs

Dear Team,

We were following the Introductory tutorial. We could complete the(Step-1 to Step-6) successfully but failed in processing Step-7(Particle Picking Blob Picker) as the job couldn’t detect the GPU on cluster server with the following error.

"cryosparc_master/cryosparc_compute/jobs/template_picker_gpu/run.py

", line 59, in cryosparc_compute.jobs.template_picker_gpu.run.run?
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, lin
e 29, in cryosparc_compute.engine.cuda_core.initialize?pycuda._driv
er.RuntimeError: cuInit failed: no CUDA-capable device is detected

I have attached the job event log with post, please help us in following tutorial and start with actual processing of data.

Thanks,
Varun Jha
HPC Team, IIT Delhi

Please can you provide additional information:

  1. Which version of CryoSPARC are you using?
  2. Is the CryoSPARC instance setup as
    • a combined master/worker on single host
    • master host with separate connected workers
    • master host with connected cluster
  3. Does the “cluster server” have a compatible GPU, compatible GPU driver and, if the CryoSPARC version is below 4.4, a compatible CUDA toolkit (guide)? What are the outputs of the commands
    nvidia-smi
    /path/to/cryosparc_worker/bin/cryosparcw gpulist
    
    ?

Dear Team,

Please find the required answers with the corresponding outputs of the commands

  1. Which version of CryoSPARC are you using?

Cryosparc v4.1.2

  1. Is the CryoSPARC instance setup as
  • master host with connected cluster
  1. Does the “cluster server” have a compatible GPU, compatible GPU driver and, if the CryoSPARC version is below 4.4, a compatible CUDA toolkit (guide)? What are the outputs of the commands

Yes, Cluster server have compatible GPU(Kepler40,Volta100,Ampere100)

nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:86:00.0 Off | 0 |
| N/A 34C P0 37W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00000000:D8:00.0 Off | 0 |
| N/A 33C P0 40W / 250W | 0MiB / 32768MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
./cryosparcw gpulist

Detected 2 CUDA devices.

id pci-bus name

   0      0000:86:00.0  Tesla V100-PCIE-32GB
   1      0000:D8:00.0  Tesla V100-PCIE-32GB

  1. Additional information when I am trying to manually add a test GPU node to master node with command:

./cryosparcw connect --worker <node_name> --master <master_node_name> --port <port_number> --nossd

Autodetecting available GPUs…
Traceback (most recent call last):
File “bin/connect.py”, line 221, in
gpu_devidxs = check_gpus()
File “bin/connect.py”, line 91, in check_gpus
num_devs = print_gpu_list()
File “bin/connect.py”, line 23, in print_gpu_list
import pycuda.driver as cudrv
File “/scratch/cc/vfaculty/varunj.vfaculty/cryosparcv4.1.2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/driver.py”, line 62, in
from pycuda._driver import * # noqa
ImportError: libcurand.so.10.0: cannot open shared object file: No such file or directory

I have attached the output of the required command, please assist us further to rectify the problems, move ahead with Introductory tutorial and processing the real data.

Thanks,
Varun Jha
HPC Team, IIT Delhi

Did you run this command on the GPU node? If so, please can you post the output of these commands running on the same GPU node:

/path/to/cryosparc_worker/bin/cryosparcw call which nvcc
/path/to/cryosparc_worker/bin/cryosparcw call nvcc --version
/path/to/cryosparc_worker/bin/cryosparcw call python -c "import pycuda.driver; print(pycuda.driver.get_version())"

Is the directory /scratch/cc/vfaculty/varunj.vfaculty/cryosparcv4.1.2/cryosparc_worker shared between cluster nodes?

Did you consider:

  1. ensuring nvidia driver version ≥ 520 on all cluster GPU nodes
  2. updating to CryoSPARC v4.4, which bundles the CUDA toolkit and may thereby help avoid certain CUDA dependency mismatches

Dear Team,

UPDATE:
We could successfully integrate cryosparcv4.4 with our HPC cluster, could run the initial Tutorials as available on cryosparc website.

While launching some of the jobs, currently we are facing this error with some cryoem mrc files while doing patch motion correction.

I have given you the error from job.log

ValueError: 4294967295 is not a valid CUresult

BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/queues.py”, line 245, in _feed
send_bytes(obj)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 411, in _send_bytes
self._send(header + buf)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/queues.py”, line 245, in _feed
send_bytes(obj)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 411, in _send_bytes
self._send(header + buf)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)

FYI we are working with GPU version as given below:

| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2

Please help us in rectifying the issue in launching the jobs, your early response will be appreciated.

Thanks,
Varun Jha
HPC Team IIT Delhi

@varun_jha Please can you run this command after replacing substituting actual project and job IDs for the

  • Patch Motion Correction
  • upstream Import Movies

jobs:

cryosparcm cli "get_job('PX', 'JX', 'job_type', 'params_spec', 'version', 'instance_information')"