Cryosparc V3.3.1 does not recognize GPU card

I have a GPU server which has 2 RTX 8000 and 2 RTX 2080 Ti. But cryosparc can only recognize 2 RTX 8000.

I tried reconnect –./bin/cryosparcw connect --master G05 --worker G05 --port 39000 --ssdpath /ssd/cryosparc_cache/ --gpus 0,1,2,3. It prompts the error:
Traceback (most recent call last):
File “bin/connect.py”, line 221, in
gpu_devidxs = check_gpus()
File “bin/connect.py”, line 95, in check_gpus
assert all([v in range(num_devs) for v in gpu_devidxs]), “Some specified devices do not exist.”
AssertionError: Some specified devices do not exist.

I refer to the link–Adding GPU to system - #11 by stephan, the command is executed correctly, but only 2 GPU cards are still recognized.

What do I need to do to add 2 2080Ti cards to cryosparc?

Here are my steps to execute the Python command – Adding GPU to system - #11 by stephan

Welcome to the forum @saberli. Please can you post the output of this (slightly modified from above) command on the worker:
./bin/cryosparcw gpulist && nvidia-smi
and also let us know the cryoSPARC and CUDA (as configured for this cryoSPARC worker) versions?

./bin/cryosparcw gpulist
Detected 2 CUDA devices.

id pci-bus name

   0      0000:18:00.0  Quadro RTX 8000
   1      0000:3B:00.0  Quadro RTX 8000

nvidia-smi
Sun Jan 23 19:20:39 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:18:00.0 Off | Off |
| 34% 30C P8 8W / 260W | 863MiB / 48601MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 8000 Off | 00000000:3B:00.0 Off | Off |
| 35% 30C P8 3W / 260W | 360MiB / 48601MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … Off | 00000000:86:00.0 Off | N/A |
| 27% 31C P8 1W / 250W | 352MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA GeForce … Off | 00000000:AF:00.0 Off | N/A |
| 27% 30C P8 20W / 250W | 352MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

cryosparc version 3.3.1, cuda version 11.4

@saberli Please can you check the output of:
./bin/cryosparcw gpulist && echo XX${CUDA_VISIBLE_DEVICES}YY
executed as a single command?

1 Like

./bin/cryosparcw gpulist
Detected 2 CUDA devices.

id pci-bus name

   0      0000:18:00.0  Quadro RTX 8000
   1      0000:3B:00.0  Quadro RTX 8000

echo XX${CUDA_VISIBLE_DEVICES}YY
XXGPU-db332699-0b98-b109-6b9b-cfb1c5e2ae18,GPU-7a48ef4f-f511-ff5a-f7ff-9ff0ed5f56cd,-1,GPU-c90d3536-8343-38f9-4b55-30a10f983255,GPU-187653eb-8df0-8758-84a8-0f72c67f1771:YY

CUDA_VISIBLE_DEVICES is that there are 4 GPUs available.

@saberli $CUDA_VISIBLE_DEVICES in this case includes -1 in the third position (and, for some reason, ends with a colon ‘:’). Could the -1 (third) entry prevent recognition of the 4th and 5th entries of the device list?

I don’t quite understand what you mean. The server has only 4 GPU cards, and $CUDA_VISIBLE_DEVICES has already displayed 4 GPU cards.
Please see the picture below.

Very interesting.

According to the cuda Toolkit documentation about CUDA_VISIBLE_DEVICES

“…Only the devices whose index is present in the sequence are visible to CUDA applications and they are enumerated in the order of the sequence. If one of the indices is invalid, only the devices whose index precedes the invalid index are visible to CUDA applications. For example, setting CUDA_VISIBLE_DEVICES to 2,1 causes device 0 to be invisible and device 2 to be enumerated before device 1. Setting CUDA_VISIBLE_DEVICES to 0,2,-1,1 causes devices 0 and 2 to be visible and device 1 to be invisible.”

@saberli what happens if you try to reconnect with the option --gpu 0,1,3,4?
or what happens if you do the steps with the python command again but with other id´s e.g. 0,1, 3,4

Might changing the variable $CUDA_VISIBLE_DEVICES would help to get rid of the -1?

If I use option --gpu 0,1,3,4, the same error will be prompted.

By the way, when the version is 3.2, 4 GPU cards are recognized.

@saberli Have you also changed the definition of the CUDA_VISIBLE_DEVICES variable? According to the CUDA_VISIBLE_DEVICES setting you posted earlier, 2 GPUs will (on purpose or by accident), be “hidden” from cryoSPARC.

Very good, after setting CUDA_VISIBLE_DEVICES, cryosparc can recognize 4 cards.
Thank you so much.

1 Like