Hi Cryosparc team,
I am trying to add a new workstation as worker to our current workstation. My aim is to use the current workstation (workstation1) as the master and the first worker, the new workstation (workstation2) as the second worker.
Both workstations have storage pools (/data1 in workstation1, /data2 in workstation2) for project directories. I mounted (nfs) these pools to each workstation so they both have access to each other (e.g. on workstation1, there are both /data1 and /data2).
Then I installed the worker package in workstation2 following the guide: add ssh access, test ports (ufw is inactive in both system), install package under /home/cryosparc/cryosparc_worker. Then run cryosparcw connect, the output is:
Final configuration for Workstaion2_IP
cache_path : /ssd
cache_quota_mb : None
cache_reserve_mb : 10000
desc : None
gpus : [{'id': 0, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 1, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 2, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 3, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 4, 'mem': 4090626048, 'name': 'NVIDIA T400 4GB'}]
hostname : Workstaion1_IP
lane : hostname
monitor_port : None
name : Workstaion1_IP
resource_fixed : {'SSD': True}
resource_slots : {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}
ssh_str : user@ Workstaion2_IP
title : Worker node Workstaion2_IP
type : node
worker_bin_path : /home/cryosparc/cryosparc_worker/bin/cryosparcw
I do see the new lane added after this. But when I lunched the job, it stopped at:
License is valid.
Launching job on lane workstation2 target Workstation2_IP ...
Running job on remote worker node hostname Workstation2_IP
Then I run command on master (workstation1): cryosparcm log command_core, and got:
2024-04-28 19:44:39,551 dump_job_database INFO | Request to export P29 J2
2024-04-28 19:44:39,559 dump_job_database INFO | Exporting job to /data2/yw/CS-workstation-test/J2
2024-04-28 19:44:39,560 dump_job_database INFO | Exporting all of job's images in the database to /data2/yw/CS-workstation-test/J2/gridfs_data...
2024-04-28 19:44:39,561 dump_job_database INFO | Done. Exported 0 images in 0.00s
2024-04-28 19:44:39,561 dump_job_database INFO | Exporting all job's streamlog events...
2024-04-28 19:44:39,625 scheduler_run_core INFO | Running...
2024-04-28 19:44:39,625 scheduler_run_core INFO | Jobs Queued: [('P29', 'J2')]
2024-04-28 19:44:39,629 scheduler_run_core INFO | Licenses currently active : 0
2024-04-28 19:44:39,629 scheduler_run_core INFO | Now trying to schedule J2
2024-04-28 19:44:39,629 scheduler_run_job INFO | Scheduling job to Workstation2_IP
2024-04-28 19:44:39,692 dump_job_database INFO | Done. Exported 1 files in 0.13s
2024-04-28 19:44:39,693 dump_job_database INFO | Exporting job metafile...
2024-04-28 19:44:39,722 dump_job_database INFO | Done. Exported in 0.03s
2024-04-28 19:44:39,723 dump_job_database INFO | Updating job manifest...
2024-04-28 19:44:39,738 dump_job_database INFO | Done. Updated in 0.02s
2024-04-28 19:44:39,738 dump_job_database INFO | Exported P29 J2 in 0.19s
2024-04-28 19:44:39,740 run INFO | Completed task in 0.18901276588439941 seconds
2024-04-28 19:44:40,695 scheduler_run_job INFO | Not a commercial instance - heartbeat set to 12 hours.
2024-04-28 19:44:40,761 scheduler_run_job INFO | Launchable! -- Launching.
2024-04-28 19:44:40,768 set_job_status INFO | Status changed for P29.J2 from queued to launched
2024-04-28 19:44:40,769 app_stats_refresh INFO | Calling app stats refresh url http://spgpu:39000/api/actions/stats/refresh_job for project_uid P29, workspace_uid None, job_uid J2 with body {'projectUid': 'P29', 'jobUid': 'J2'}
2024-04-28 19:44:40,776 app_stats_refresh INFO | code 200, text {"success":true}
2024-04-28 19:44:40,799 run_job INFO | Running P29 J2
2024-04-28 19:44:40,800 run_job INFO | Running job using: /home/cryosparc/cryosparc_worker/bin/cryosparcw
2024-04-28 19:44:40,800 run_job INFO | Running job on remote worker node hostname Workstaion2_IP
2024-04-28 19:44:40,802 run_job INFO | cmd: bash -c "nohup /home/cryosparc/cryosparc_worker/bin/cryosparcw run --project P29 --job J2 --master_hostname spgpu --master_command_core_port 39002 > /data2/yw/CS-workstation-test/J2/job.log 2>&1 & "
2024-04-28 19:44:41,419 run_job INFO |
2024-04-28 19:44:41,419 scheduler_run_core INFO | Finished
Then I run command on worker (workstation2): /home/cryosparc/cryosparc_worker/bin/cryosparcw run --project P29 --job J2 --master_hostname spgpu --master_command_core_port 39002, and got:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "cryosparc_master/cryosparc_compute/run.py", line 177, in cryosparc_master.cryosparc_compute.run.run
File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 134, in connect
cli.test_authentication(project_uid, job_uid)
File "/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 121, in func
raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://spgpu:39002, code 400) Encountered ServerError from JSONRPC function "test_authentication" with params ('P29', 'J2'):
ServerError: P29 J2 does not exist.
Traceback (most recent call last):
File "/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 155, in wrapper
res = func(*args, **kwargs)
File "/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 658, in test_authentication
job_status = get_job_status(project_uid, job_uid)
File "/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 186, in wrapper
return func(*args, **kwargs)
File "/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 7421, in get_job_status
return get_job(project_uid, job_uid, 'status')['status']
File "/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 186, in wrapper
return func(*args, **kwargs)
File "/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 5980, in get_job
raise ValueError(f"{project_uid} {job_uid} does not exist.")
ValueError: P29 J2 does not exist.
What could be the issue?
Thank you for your help!