Failed to get GPU info

The log file is probably too large. I think the cause may be the worker was not properly connected. I reconnected the worker with cryosparcw connect --worker localhost --master localhost --port 39000. The log showed errors with GPU even though I was able to import jobs with errors. I see most of the jobs were imported with some errors. But I couldn’t run jobs with GPU. The errors are pasted below:

 2023-06-21 12:01:03,676 COMMAND.SCHEDULER    get_gpu_info         INFO     | UPDATING WORKER GPU INFO
2023-06-21 12:01:03,677 COMMAND.JOBS         update_all_job_sizes INFO     | UPDATING ALL JOB SIZES IN 10s
2023-06-21 12:01:03,678 COMMAND.DATA         export_all_projects  INFO     | EXPORTING ALL PROJECTS IN 60s...
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    | Failed to get GPU info on worker.cryosparc.localhost.com
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    | Traceback (most recent call last):
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    |   File "/app/apps/rhel7/cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 1173, in get_gpu_info_run
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    |     value = subprocess.check_output(full_command, stderr=subprocess.STDOUT, shell=shell).decode()
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    |   File "/app/apps/rhel7/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 411, in check_output
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    |     **kwargs).stdout
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    |   File "/app/apps/rhel7/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py", line 512, in run
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    |     output=stdout, stderr=stderr)
2023-06-21 12:01:03,732 COMMAND.SCHEDULER    get_gpu_info_run     ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cryosparc@worker.cryosparc.localhost.com', 'bash -c "eval $(/app/apps/rhel7/cryosparc/cryosparc2_worker/bin/cryosparcw env); timeout 30 python /app/apps/rhel7/cryosparc/cryosparc2_worker/cryosparc_compute/get_gpu_info.py"']' returned non-zero exit status 255.
2023-06-21 12:01:13,985 COMMAND.DATA         dump_project         INFO     | Exporting project P1
2023-06-21 12:01:13,987 COMMAND.DATA         dump_project         INFO     | Exported project P1 to /DATA01/cryosparc_hz/P1/project.json in 0.00s
2023-06-21 12:01:13,990 COMMAND.DATA         dump_project         INFO     | Exporting project P2
2023-06-21 12:01:13,992 COMMAND.DATA         dump_project         INFO     | Exported project P2 to /DATA01/cryosparc_hz/P48/project.json in 0.00s
2023-06-21 12:01:14,044 COMMAND.DATA         dump_project         INFO     | Exporting project P3
2023-06-21 12:01:14,046 COMMAND.DATA         dump_project         INFO     | Exported project P3 to /data2/cryosparc_home/P13/project.json in 0.00s
2023-06-21 12:01:14,073 COMMAND.DATA         dump_project         INFO     | Exporting project P4
2023-06-21 12:01:14,075 COMMAND.DATA         dump_project         INFO     | Exported project P4 to /data2/cryosparc_home/P25/project.json in 0.00s
2023-06-21 12:01:14,082 COMMAND.DATA         dump_project         INFO     | Exporting project P5
2023-06-21 12:01:14,124 COMMAND.DATA         dump_project         INFO     | Exported project P5 to /DATA01/cryosparc_hz/P48/project.json in 0.04s
2023-06-21 12:06:03,419 COMMAND.MAIN         start                INFO     |  === EXITED === 
2023-06-21 12:06:04,467 COMMAND.MAIN         start                INFO     |  === STARTED === 
2023-06-21 12:06:04,469 COMMAND.BG_WORKER    background_worker    INFO     |  === STARTED === 
2023-06-21 12:06:04,469 COMMAND.CORE         run                  INFO     | === STARTED TASKS WORKER ===
 * Serving Flask app "command_core" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off

@haomingz I have opened a new topic for this particular error in your command_core log.
If you have not yet resolved this worker connection issue, please provide the following information:

  1. Does this CryoSPARC instance include GPU computers that are separate from from the computer that runs cryosparc_master processes?
  2. What is the output of the command
    cryosparcm status | grep HOSTNAME
    

Hi Wtempel,

  1. They are not separated; the four GPUs are part of the computer running cryosparc_master processes.
  2. The output for cryosparcm status | grep HISTNAME is: export CRYOSPARC_MASTER_HOSTNAME=“localhost”

Thanks!

Thanks. Please can you also post the output of

cryosparcm "get_scheduler_targets()"

.

My fault. Please can you try again with the modified command (where I added quotes).

here it comes:

[cryosparc@lnx00013 run]$ cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/data2/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 50953846784, ‘name’: ‘Quadro RTX 8000’}, {‘id’: 1, ‘mem’: 50950963200, ‘name’: ‘Quadro RTX 8000’}, {‘id’: 2, ‘mem’: 50953846784, ‘name’: ‘Quadro RTX 8000’}, {‘id’: 3, ‘mem’: 50953846784, ‘name’: ‘Quadro RTX 8000’}], ‘hostname’: ‘worker.cryosparc.localhost.com’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘worker.cryosparc.localhost.com’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, ‘ssh_str’: ‘cryosparc@worker.cryosparc.localhost.com’, ‘title’: ‘Worker node worker.cryosparc.localhost.com’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/app/apps/rhel7/cryosparc/cryosparc2_worker/bin/cryosparcw’}, {‘cache_path’: ‘/data2/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 50953846784, ‘name’: ‘Quadro RTX 8000’}, {‘id’: 1, ‘mem’: 50950963200, ‘name’: ‘Quadro RTX 8000’}, {‘id’: 2, ‘mem’: 50953846784, ‘name’: ‘Quadro RTX 8000’}, {‘id’: 3, ‘mem’: 50953846784, ‘name’: ‘Quadro RTX 8000’}], ‘hostname’: ‘localhost’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘localhost’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, ‘ssh_str’: ‘cryosparc@localhost’, ‘title’: ‘Worker node localhost’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/app/apps/rhel7/cryosparc/cryosparc2_worker/bin/cryosparcw’}]

According to this output, there are two connected workers:

  1. worker.cryosparc.localhost.com
  2. localhost

Based on

and

I would suggest you remove the worker.cryosparc.localhost.com record (details):

cryosparcm cli "remove_scheduler_target_node('worker.cryosparc.localhost.com')"

Yes this seems fixing the GPU problem. Thank you very much!!