Can't lunch job on new worker

parrot · April 29, 2024, 1:25am

Hi Cryosparc team,

I am trying to add a new workstation as worker to our current workstation. My aim is to use the current workstation (workstation1) as the master and the first worker, the new workstation (workstation2) as the second worker.

Both workstations have storage pools (/data1 in workstation1, /data2 in workstation2) for project directories. I mounted (nfs) these pools to each workstation so they both have access to each other (e.g. on workstation1, there are both /data1 and /data2).

Then I installed the worker package in workstation2 following the guide: add ssh access, test ports (ufw is inactive in both system), install package under /home/cryosparc/cryosparc_worker. Then run cryosparcw connect, the output is:

  Final configuration for Workstaion2_IP
               cache_path :  /ssd
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 1, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 2, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 3, 'mem': 51010207744, 'name': 'NVIDIA RTX 6000 Ada Generation'}, {'id': 4, 'mem': 4090626048, 'name': 'NVIDIA T400 4GB'}]
                 hostname :  Workstaion1_IP
                     lane :  hostname
             monitor_port :  None
                     name :  Workstaion1_IP
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]}
                  ssh_str :  user@ Workstaion2_IP
                    title :  Worker node Workstaion2_IP
                     type :  node
          worker_bin_path :  /home/cryosparc/cryosparc_worker/bin/cryosparcw

I do see the new lane added after this. But when I lunched the job, it stopped at:

License is valid.
Launching job on lane workstation2 target Workstation2_IP ...
Running job on remote worker node hostname Workstation2_IP

Then I run command on master (workstation1): cryosparcm log command_core, and got:

2024-04-28 19:44:39,551 dump_job_database    INFO     | Request to export P29 J2
2024-04-28 19:44:39,559 dump_job_database    INFO     |    Exporting job to /data2/yw/CS-workstation-test/J2
2024-04-28 19:44:39,560 dump_job_database    INFO     |    Exporting all of job's images in the database to /data2/yw/CS-workstation-test/J2/gridfs_data...
2024-04-28 19:44:39,561 dump_job_database    INFO     |    Done. Exported 0 images in 0.00s
2024-04-28 19:44:39,561 dump_job_database    INFO     |    Exporting all job's streamlog events...
2024-04-28 19:44:39,625 scheduler_run_core   INFO     | Running...
2024-04-28 19:44:39,625 scheduler_run_core   INFO     | Jobs Queued: [('P29', 'J2')]
2024-04-28 19:44:39,629 scheduler_run_core   INFO     | Licenses currently active : 0
2024-04-28 19:44:39,629 scheduler_run_core   INFO     | Now trying to schedule J2
2024-04-28 19:44:39,629 scheduler_run_job    INFO     |    Scheduling job to Workstation2_IP
2024-04-28 19:44:39,692 dump_job_database    INFO     |    Done. Exported 1 files in 0.13s
2024-04-28 19:44:39,693 dump_job_database    INFO     |    Exporting job metafile...
2024-04-28 19:44:39,722 dump_job_database    INFO     |    Done. Exported in 0.03s
2024-04-28 19:44:39,723 dump_job_database    INFO     |    Updating job manifest...
2024-04-28 19:44:39,738 dump_job_database    INFO     |    Done. Updated in 0.02s
2024-04-28 19:44:39,738 dump_job_database    INFO     | Exported P29 J2 in 0.19s
2024-04-28 19:44:39,740 run                  INFO     | Completed task in 0.18901276588439941 seconds
2024-04-28 19:44:40,695 scheduler_run_job    INFO     | Not a commercial instance - heartbeat set to 12 hours.
2024-04-28 19:44:40,761 scheduler_run_job    INFO     |      Launchable! -- Launching.
2024-04-28 19:44:40,768 set_job_status       INFO     | Status changed for P29.J2 from queued to launched
2024-04-28 19:44:40,769 app_stats_refresh    INFO     | Calling app stats refresh url http://spgpu:39000/api/actions/stats/refresh_job for project_uid P29, workspace_uid None, job_uid J2 with body {'projectUid': 'P29', 'jobUid': 'J2'}
2024-04-28 19:44:40,776 app_stats_refresh    INFO     | code 200, text {"success":true}
2024-04-28 19:44:40,799 run_job              INFO     |       Running P29 J2
2024-04-28 19:44:40,800 run_job              INFO     |         Running job using: /home/cryosparc/cryosparc_worker/bin/cryosparcw
2024-04-28 19:44:40,800 run_job              INFO     |         Running job on remote worker node hostname Workstaion2_IP
2024-04-28 19:44:40,802 run_job              INFO     |         cmd: bash -c "nohup /home/cryosparc/cryosparc_worker/bin/cryosparcw run --project P29 --job J2 --master_hostname spgpu --master_command_core_port 39002 > /data2/yw/CS-workstation-test/J2/job.log 2>&1 & "
2024-04-28 19:44:41,419 run_job              INFO     | 
2024-04-28 19:44:41,419 scheduler_run_core   INFO     | Finished

Then I run command on worker (workstation2): /home/cryosparc/cryosparc_worker/bin/cryosparcw run --project P29 --job J2 --master_hostname spgpu --master_command_core_port 39002, and got:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 177, in cryosparc_master.cryosparc_compute.run.run
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 134, in connect
    cli.test_authentication(project_uid, job_uid)
  File "/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 121, in func
    raise CommandError(
cryosparc_tools.cryosparc.errors.CommandError: *** (http://spgpu:39002, code 400) Encountered ServerError from JSONRPC function "test_authentication" with params ('P29', 'J2'):
ServerError: P29 J2 does not exist.
Traceback (most recent call last):
  File "/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 155, in wrapper
    res = func(*args, **kwargs)
  File "/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 658, in test_authentication
    job_status = get_job_status(project_uid, job_uid)
  File "/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 186, in wrapper
    return func(*args, **kwargs)
  File "/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 7421, in get_job_status
    return get_job(project_uid, job_uid, 'status')['status']
  File "/opt/cryosparc/cryosparc_master/cryosparc_command/commandcommon.py", line 186, in wrapper
    return func(*args, **kwargs)
  File "/opt/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 5980, in get_job
    raise ValueError(f"{project_uid} {job_uid} does not exist.")
ValueError: P29 J2 does not exist.

What could be the issue?

Thank you for your help!

parrot · April 29, 2024, 11:34pm

My friend helped to me to pinpointed the issue. The hostname of the master is a name but not ip address. However, the new worker has to use the ip address to find the master and register. After we changed the hostname of the master in config.sh to itself ip address, the new worker started running the job without error.