Issue in launching GPU Jobs

varun_jha · December 1, 2023, 1:09pm

Dear Team,

We were following the Introductory tutorial. We could complete the(Step-1 to Step-6) successfully but failed in processing Step-7(Particle Picking Blob Picker) as the job couldn’t detect the GPU on cluster server with the following error.

"cryosparc_master/cryosparc_compute/jobs/template_picker_gpu/run.py

", line 59, in cryosparc_compute.jobs.template_picker_gpu.run.run?
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, lin
e 29, in cryosparc_compute.engine.cuda_core.initialize?pycuda._driv
er.RuntimeError: cuInit failed: no CUDA-capable device is detected

I have attached the job event log with post, please help us in following tutorial and start with actual processing of data.

Thanks,
Varun Jha
HPC Team, IIT Delhi

wtempel · December 1, 2023, 2:34pm

Please can you provide additional information:

Which version of CryoSPARC are you using?
Is the CryoSPARC instance setup as
- a combined master/worker on single host
- master host with separate connected workers
- master host with connected cluster
Does the “cluster server” have a compatible GPU, compatible GPU driver and, if the CryoSPARC version is below 4.4, a compatible CUDA toolkit (guide)? What are the outputs of the commands
```
nvidia-smi
/path/to/cryosparc_worker/bin/cryosparcw gpulist
```
?

varun_jha · December 3, 2023, 4:33am

Dear Team,

Please find the required answers with the corresponding outputs of the commands

Which version of CryoSPARC are you using?

Cryosparc v4.1.2

Is the CryoSPARC instance setup as

master host with connected cluster

Does the “cluster server” have a compatible GPU, compatible GPU driver and, if the CryoSPARC version is below 4.4, a compatible CUDA toolkit (guide)? What are the outputs of the commands

Yes, Cluster server have compatible GPU(Kepler40,Volta100,Ampere100)

nvidia-smi

Detected 2 CUDA devices.

id pci-bus name

   0      0000:86:00.0  Tesla V100-PCIE-32GB
   1      0000:D8:00.0  Tesla V100-PCIE-32GB

Additional information when I am trying to manually add a test GPU node to master node with command:

./cryosparcw connect --worker <node_name> --master <master_node_name> --port <port_number> --nossd

Autodetecting available GPUs…
Traceback (most recent call last):
File “bin/connect.py”, line 221, in
gpu_devidxs = check_gpus()
File “bin/connect.py”, line 91, in check_gpus
num_devs = print_gpu_list()
File “bin/connect.py”, line 23, in print_gpu_list
import pycuda.driver as cudrv
File “/scratch/cc/vfaculty/varunj.vfaculty/cryosparcv4.1.2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/driver.py”, line 62, in
from pycuda._driver import * # noqa
ImportError: libcurand.so.10.0: cannot open shared object file: No such file or directory

I have attached the output of the required command, please assist us further to rectify the problems, move ahead with Introductory tutorial and processing the real data.

Thanks,
Varun Jha
HPC Team, IIT Delhi

wtempel · December 4, 2023, 8:57pm

Did you run this command on the GPU node? If so, please can you post the output of these commands running on the same GPU node:

/path/to/cryosparc_worker/bin/cryosparcw call which nvcc
/path/to/cryosparc_worker/bin/cryosparcw call nvcc --version
/path/to/cryosparc_worker/bin/cryosparcw call python -c "import pycuda.driver; print(pycuda.driver.get_version())"

Is the directory /scratch/cc/vfaculty/varunj.vfaculty/cryosparcv4.1.2/cryosparc_worker shared between cluster nodes?

Did you consider:

ensuring nvidia driver version ≥ 520 on all cluster GPU nodes
updating to CryoSPARC v4.4, which bundles the CUDA toolkit and may thereby help avoid certain CUDA dependency mismatches

varun_jha · February 17, 2024, 1:01pm

Dear Team,

UPDATE:
We could successfully integrate cryosparcv4.4 with our HPC cluster, could run the initial Tutorials as available on cryosparc website.

While launching some of the jobs, currently we are facing this error with some cryoem mrc files while doing patch motion correction.

I have given you the error from job.log

ValueError: 4294967295 is not a valid CUresult

BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/queues.py”, line 245, in _feed
send_bytes(obj)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 411, in _send_bytes
self._send(header + buf)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/queues.py”, line 245, in _feed
send_bytes(obj)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 411, in _send_bytes
self._send(header + buf)
File “/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/home/bioschool/irdstaff/ird600491/cryosparc_install/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)

FYI we are working with GPU version as given below:

| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2

Please help us in rectifying the issue in launching the jobs, your early response will be appreciated.

Thanks,
Varun Jha
HPC Team IIT Delhi

wtempel · February 20, 2024, 10:24pm

@varun_jha Please can you run this command after replacing substituting actual project and job IDs for the

Patch Motion Correction
upstream Import Movies

jobs:

cryosparcm cli "get_job('PX', 'JX', 'job_type', 'params_spec', 'version', 'instance_information')"