Multiple GPUs but always GPU 0 assigned

closed

#1

Hey guys,
I am currently struggling with a GPU assignment issue in CryoSPARC v2.11.0 on CentOS 7
My standalone workstation has

cryosparcw gpulist
Detected 2 CUDA devices.

id pci-bus name

   0      0000:3B:00.0  GeForce RTX 2080 Ti
   1      0000:AF:00.0  GeForce RTX 2080 Ti

with
export CRYOSPARC_CUDA_PATH="/usr/local/cuda-10.1"

and both GPUs are enabled in cryosparcw connect

However, parallel jobs are always assigned to ID 0 only, i.e. two refinement jobs that should run in parallel on ID 0 and ID 1 are both running on ID 0. This leads to the following error message

Traceback (most recent call last): File “cryosparc2_compute/jobs/runcommon.py”, line 1481, in run_with_except_hook run_old(*args, **kw) File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 991, in cryosparc2_compute.engine.engine.process.work File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 87, in cryosparc2_compute.engine.engine.EngineThread.load_image_data_gpu File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/gpuarray.py”, line 549, in fill func = elementwise.get_fill_kernel(self.dtype) File “<decorator-gen-13>”, line 2, in get_fill_kernel File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/tools.py”, line 430, in context_dependent_memoize result = func(*args) File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/elementwise.py”, line 496, in get_fill_kernel “fill”) File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/elementwise.py”, line 161, in get_elwise_kernel arguments, operation, name, keep, options, **kwargs) File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/elementwise.py”, line 147, in get_elwise_kernel_and_types keep, options, **kwargs) File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/elementwise.py”, line 75, in get_elwise_module options=options, keep=keep) File “/Local/app/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/compiler.py”, line 294, in init self.module = module_from_buffer(cubin) LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered -

Even if I connect only GPU ID 1 the job gets assigned to GPU ID 0 and crashes with the same error message. I have tried rebooting the machine and cryosparcm. If I connect only GPU ID 0 all jobs run fine. I can use GPU ID 1 for other programs, i.e. cryolo, relion so I assume somethings wrong in my cryosparc configuration. Everything runs fine on a parallel identical workstation.

Any help is greatly appreciated

Cheers,
Dan


#2

Do you have CUDA_VISIBLE_DEVICES set? I know the default cluster scripts set this (e.g. export CUDA_VISIBLE_DEVICES=0,1) but I don’t know whether it should be necessary on single-user installations.


#3

yeah I had CUDA_VISIBLE_DEVICES set, but thanks for your answer!

The affected GPU failed several NVIDIA memory tests with cudaErrorIllegalAddress. I didn’t realize it immediately because the error appears only every now and then.
I swapped the GPUs physically and the problem travelled with the GPU, so it’s a hardware issue and cryosparc did everything right. I will return the GPU.

Thanks for your help!