No GPU available failed to get GPU info

dotan · September 30, 2024, 9:23pm

Hi, I need help figuring out why the GPUs are not available for cryoSPARC. I read the previous posts on the same topic but couldn’t identify the cause myself. I also tried upgrading cryosparc to 4.6, reconnect the worker, restarting cryosparc, but nothing works.

The error from the output of “cryosparcm log command_core” are pasted below.

2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    | Failed to get GPU info on lab3.pharm.sunysb.edu
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    | Traceback (most recent call last):
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    |   File "/mnt/data0/cryosparc_v2/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1520, in get_gpu_info_run
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    |     value = subprocess.check_output(full_command, stderr=subprocess.STDOUT, shell=shell, timeout=JOB_LAUNCH_TIMEOUT_SECONDS).decode()
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    |   File "/mnt/data0/cryosparc_v2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 421, in check_output
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    |   File "/mnt/data0/cryosparc_v2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 526, in run
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    |     raise CalledProcessError(retcode, process.args,
2024-09-30 16:10:41,193 get_gpu_info_run     ERROR    | subprocess.CalledProcessError: Command '['bash -c "eval $(/mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw env); python /mnt/data0/cryosparc_v2/cryosparc2_worker/cryosparc_compute/get_gpu_info.py"']' returned non-zero exit status 2.
2024-09-30 16:10:51,176 update_all_job_sizes_run INFO     | Finished updating all job sizes (0 jobs updated, 0 projects updated)
........
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    | Failed to get GPU info on llab3.pharm.sunysb.edu
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    | Traceback (most recent call last):
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    |   File "/mnt/data0/cryosparc_v2/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1520, in get_gpu_info_run
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    |     value = subprocess.check_output(full_command, stderr=subprocess.STDOUT, shell=shell, timeout=JOB_LAUNCH_TIMEOUT_SECONDS).decode()
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    |   File "/mnt/data0/cryosparc_v2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 421, in check_output
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    |   File "/mnt/data0/cryosparc_v2/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 526, in run
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    |     raise CalledProcessError(retcode, process.args,
2024-09-30 16:25:23,647 get_gpu_info_run     ERROR    | subprocess.CalledProcessError: Command '['bash -c "eval $(/mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw env); python /mnt/data0/cryosparc_v2/cryosparc2_worker/cryosparc_compute/get_gpu_info.py"']' returned non-zero exit status 2.

The output of cryosparcm cli “get_scheduler_targets()” is below.
[{‘cache_path’: ‘/home/xxx/cryosparc2_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 100urce_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [title’: ‘Worker node lab3.pharm.sunysb.edu’, ‘type’: ‘node’, ‘worker_bin_path’: '/mnt/d

This is a single workstation with cryoSPARC version: v4.6.0

Thanks in advance!

wtempel · October 1, 2024, 2:31pm

The output from get_scheduler_targets() seems to be truncated or otherwise corrupted.
Please can you run the command again and post its output.
Please also post the output of the commands

hostname -f
nvidia-smi

dotan · October 1, 2024, 3:22pm

Here it is.

cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/home/hng/cryosparc2_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘hostname’: ‘xxxlab3.pharm.sunysb.edu’, ‘lane’: ‘default’, ‘name’: ‘xxxlab3.pharm.sunysb.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘xxx@xxxlab3.pharm.sunysb.edu’, ‘title’: ‘Worker node xxxlab3.pharm.sunysb.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw’}]

hostname -f
dhcp-129-49-144-26

nvidia-smi
Tue Oct 1 11:20:49 2024
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:02:00.0 Off | N/A |
| 23% 37C P8 10W / 250W | 139MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:03:00.0 Off | N/A |
| 23% 37C P8 11W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … Off | 00000000:83:00.0 Off | N/A |
| 23% 39C P8 11W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA GeForce … Off | 00000000:84:00.0 Off | N/A |
| 23% 39C P8 11W / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1825 G /usr/bin/X 20MiB |
| 0 N/A N/A 17380 G /usr/bin/X 57MiB |
| 0 N/A N/A 17436 G /usr/bin/gnome-shell 57MiB |
±----------------------------------------------------------------------------+

wtempel · October 2, 2024, 6:08pm

Thanks @dotan .
What is the output of the commands

/mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw gpulist
cat /mnt/data0/cryosparc_v2/cryosparc2_worker/version
ls -l /mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw

?

dotan · October 2, 2024, 8:18pm

Detected 4 CUDA devices.

id pci-bus name

   0                 2  NVIDIA GeForce GTX 1080 Ti                                                                
   1                 3  NVIDIA GeForce GTX 1080 Ti                                                                
   2               131  NVIDIA GeForce GTX 1080 Ti                                                                
   3               132  NVIDIA GeForce GTX 1080 Ti

cat /mnt/data0/cryosparc_v2/cryosparc_worker/version
v4.6.0

ls -l /mnt/data0/cryosparc_v2/cryosparc_worker/bin/cryosparcw
-rwxr-xr-x. 1 hng hng 14496 Sep 10 10:34 /mnt/data0/cryosparc_v2/cryosparc_worker/bin/cryosparc

wtempel · October 2, 2024, 9:39pm

Thanks @dotan. Please can you also try in a fresh shell

/mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw call python /mnt/data0/cryosparc_v2/cryosparc2_worker/cryosparc_compute/get_gpu_info.py

dotan · October 2, 2024, 10:20pm

Here is the output.

[{“id”: 0, “name”: “NVIDIA GeForce GTX 1080 Ti”, “mem”: 11714756608}, {“id”: 1, “name”: “NVIDIA GeForce GTX 1080 Ti”, “mem”: 11714887680}, {“id”: 2, “name”: “NVIDIA GeForce GTX 1080 Ti”, “mem”: 11714887680}, {“id”: 3, “name”: “NVIDIA GeForce GTX 1080 Ti”, “mem”: 11714887680}]

wtempel · October 11, 2024, 3:09pm

@dotan May I ask a few more questions?

Is this a “standalone” (CryoSPARC master and worker combined in a single host?
dotan:

why the GPUs are not available for cryoSPARC.

How does this unavailability manifest itself? What error messages do you see and in which log?
dotan:

dhcp-129-49-144-26

Is this hostname assigned persistently to the CryoSPARC master host, that is, the hostname will not change during reboots? Could it be a problem that the hostname does neither match the ssh_str or hostname value of the scheduler target configuration?
dotan:

reconnect the worker

Did you remove the target node (remove_scheduler_target_node() or remove_scheduler_lane(), details) prior to reconnection?
Please can send us the tgz file created with the command
cryosparcm snaplogs. I will send you a private message about the email address.

dotan · October 11, 2024, 7:59pm

Hi,

Yes, it is a single workstation.
In the cryoSPARC Builder tab, it shows “No GPUs available” like this.

Screenshot 2024-10-11 at 12.13.09 PM708×370 21.1 KB
When I installed cryoSPARC, the hostname is set as “phmtanlab3.pharm.sunysb.edu”. Though dhcp-129-49-144-26 has the IP address of the machine, not sure why that is the hostname…
No.
I will send you the tgz file momentarily.

Thank you!

wtempel · October 11, 2024, 8:44pm

You may want to reach out to your IT support team and ask for instructions on how to ensure a persistent hostname that can be resolved by other computers on your lab’s network. This would be particular important if you ever wanted to add an additional GPU worker node to this CryoSPARC instance. If your IT support can assign a permanent hostname, but cannot preserve the phmtanlab3.pharm.sunysb.edu hostname, some reconfiguration of your CryoSPARC installation may be required. If you have any questions in this regard, please feel free to post them on this forum.
For now, you may try these commands (these commands apply only to this particular CryoSPARC installation in its current, unusual state):

cryosparcm cli "remove_scheduler_lane('default')"
/mnt/data0/cryosparc_v2/cryosparc2_worker/bin/cryosparcw connect --worker phmtanlab3.pharm.sunysb.edu --master 127.0.0.1 --port 39000 --ssdpath /home/hng/cryosparc2_cache

Does this help?

dotan · October 12, 2024, 2:15pm

This is very helpful! Thank you very much!