I’m trying to add another workstation as a worker to an existing cryoSPARC 2 installation. The new worker has all the required NFS mounts (including the cryosparc2 worker installation and cache), and there is 2-way password-less SSH. The worker has no SSD cache of its own, and is configured as its own lane.
If I launch a job on this lane without disabling SSD cache, then the job shows as queued forever and command_core.log shows it’s hung because it can’t cache. If I then re-launch the job with SSD cache disabled, the job becomes “started” but never advances. The command_core.log only reveals “failed to connect link” error:
Setting parameter J22.compute_use_ssd with value False of type <type 'bool'>
---------- Scheduler running ---------------
Lane chihiro node : Jobs Queued (nonpaused, inputs ready): [u'J22']\
Total slots: {u'chihiro': {u'GPU': set([0, 1]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7]),
u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7])}}
Available slots: {u'chihiro': {u'GPU': set([0, 1]), u'RAM': set([0, 1, 2, 3, 4, 5, 6,
7]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7])}}
Available licen: 10000
Now trying to schedule J22
Need slots : {u'GPU': 2, u'RAM': 3, u'CPU': 2}
Need fixed : {u'SSD': False}
Need licen : True
Master direct : False
Trying to schedule on chihiro
Launchable: True
Alloc slots : {u'GPU': [0, 1], u'RAM': [0, 1, 2], u'CPU': [0, 1]}
Alloc fixed : {u'SSD': False}
Alloc licen : True
-- Launchable! -- Launching.
---- Running project UID P28 job UID J22
failed to connect link
License Data:
License Signature:
Running job on worker type node
Running job using: /mnt/foo/cryosparc2/cryosparc2_worker/bin/cryosparcw
Running job on remote worker node hostname chihiro
cmd: bash -c "nohup /mnt/chihiro-data/cryosparc2/cryosparc2_worker /bin/cryosparcw run --project P28 --job J22 --master_hostname ishtar --master_command_core_port 39002 > /mnt/foo/cryoSPARC/P28/J22/job.log 2>&1 & "
Changed job P28.J22 status started