Job stuck on particular gpu (GPU1)

yodamoppet · August 10, 2022, 2:35pm

Greetings.

I have a workstation with two GPUs (GPU0 and GPU1). Jobs run fine on GPU0, but remain stuck when queued to GPU1.

Here is the job log for the stuck job:

================= CRYOSPARCW ======= 2022-08-09 14:25:47.139386 =========
Project P7 Job J64
Master drake.structbio.pitt.edu Port 39002

========= monitor process now starting main process
MAINPROCESS PID 13867
MAIN PID 13867
extract.run cryosparc_compute.jobs.jobregister
Traceback (most recent call last):
File “”, line 1, in
File “cryosparc_worker/cryosparc_compute/run.py”, line 173, in cryosparc_compute.run.run
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1969, in get_gpu_info
} for devid in devs ]
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1969, in
} for devid in devs ]
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal

Process Process-1:1:
Traceback (most recent call last):
File “/data/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py”, line 297, in _bootstrap
self.run()
File “/data/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/process.py”, line 99, in run
self._target(*self._args, **self._kwargs)
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 69, in process_pipeline_work
process_params = process_setup(proc_idx) # do any setup you want on a per-process basis
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py”, line 384, in process_setup
cuda_core.initialize([cuda_dev])
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 34, in cryosparc_compute.engine.cuda_core.initialize
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal

wtempel · August 10, 2022, 6:48pm

@yodamoppet Please can you post the outputs of:
cryosparcm cli "get_scheduler_targets()"
and
/data/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
.

yodamoppet · August 10, 2022, 7:58pm

Sure. The output is below. I find it interesting that GPU1 isn’t listed – I see it in both nvidia-smi and in the Cryosparc GUI.

./cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/data/cryosparc/scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 2097086464, ‘name’: ‘Quadro P620’}], ‘hostname’: ‘drake.structbio.pitt.edu’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘drake.structbio.pitt.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4]}, ‘ssh_str’: ‘cryosparcuser@drake.structbio.pitt.edu’, ‘title’: ‘Worker node drake.structbio.pitt.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data/cryosparc/cryosparc_worker/bin/cryosparcw’}

/data/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
Detected 1 CUDA devices.

id pci-bus name

   0      0000:03:00.0  Quadro P620

nvidia-smi

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2250 G /usr/bin/X 123MiB |
| 0 3740 G /usr/bin/gnome-shell 98MiB |
±----------------------------------------------------------------------------+

wtempel · August 10, 2022, 9:36pm

GPUs can be hidden in certain environments, possibly via an environment variable such as CUDA_VISIBLE_DEVICES.
What is the output from this sequence of commands:

eval $(/data/cryosparc/cryosparc_worker/bin/cryosparcw env)
env | grep CUDA

?

yodamoppet · August 11, 2022, 1:28pm

Indeed, you are correct. It looks like someone set CUDA_VISIBLE_DEVICES in their startup script:

[root@drake ~]# env | grep CUDA
CUDA_VISIBLE_DEVICES=0

I corrected this and restarted Cryosparc – all is perfect now.

Thanks so much!

Job stuck on particular gpu (GPU1)

================= CRYOSPARCW ======= 2022-08-09 14:25:47.139386 ========= Project P7 Job J64 Master drake.structbio.pitt.edu Port 39002

id pci-bus name

================= CRYOSPARCW ======= 2022-08-09 14:25:47.139386 =========
Project P7 Job J64
Master drake.structbio.pitt.edu Port 39002