Failed to connect link / [Errno 113] No route to host

solved
open

#1

I’m trying to add another workstation as a worker to an existing cryoSPARC 2 installation. The new worker has all the required NFS mounts (including the cryosparc2 worker installation and cache), and there is 2-way password-less SSH. The worker has no SSD cache of its own, and is configured as its own lane.

If I launch a job on this lane without disabling SSD cache, then the job shows as queued forever and command_core.log shows it’s hung because it can’t cache. If I then re-launch the job with SSD cache disabled, the job becomes “started” but never advances. The command_core.log only reveals “failed to connect link” error:

Setting parameter J22.compute_use_ssd with value False of type <type 'bool'>
---------- Scheduler running --------------- 
Lane  chihiro node : Jobs Queued (nonpaused, inputs ready):  [u'J22']\
Total slots:  {u'chihiro': {u'GPU': set([0, 1]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7]), 
u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7])}}
Available slots:  {u'chihiro': {u'GPU': set([0, 1]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 
7]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7])}}
Available licen:  10000
Now trying to schedule J22
  Need slots :  {u'GPU': 2, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on chihiro
    Launchable:  True
    Alloc slots :  {u'GPU': [0, 1], u'RAM': [0, 1, 2], u'CPU': [0, 1]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
---- Running project UID P28 job UID J22 
failed to connect link
License Data:
License Signature:
     Running job on worker type node
     Running job using:  /mnt/foo/cryosparc2/cryosparc2_worker/bin/cryosparcw
     Running job on remote worker node hostname chihiro
     cmd: bash -c "nohup /mnt/chihiro-data/cryosparc2/cryosparc2_worker    /bin/cryosparcw run --project P28 --job J22 --master_hostname ishtar --master_command_core_port 39002 > /mnt/foo/cryoSPARC/P28/J22/job.log 2>&1 & "
Changed job P28.J22 status started

#2

Hi @DanielAsarnow,

Thanks for reaching out. Is it possible if you can reply with/pm me the output of cryosparcm cli "get_worker_nodes()"? Feel free to censor any personal information.


#3

Thanks for the reply, here’s the output of the command:

[{u'lane': u'default', u'name': u'ishtar', u'title': u'Worker node ishtar', u'resource_slots': {u'GPU': [0, 1, 2, 3], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, u'hostname': u'ishtar', u'worker_bin_path': u'/mnt/ishtar-data/cryosparc2/cryosparc2_worker/bin/cryosparcw', u'cache_path': u'/mnt/scratch/cryosparc2', u'cache_quota_mb': None, u'resource_fixed': {u'SSD': True}, u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'cryosparc@ishtar', u'desc': None}, {u'lane': u'chihiro', u'name': u'chihiro', u'title': u'Worker node chihiro', u'resource_slots': {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3, 4, 5, 6, 7], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7]}, u'hostname': u'chihiro', u'worker_bin_path': u'/mnt/chihiro-data/cryosparc2/cryosparc2_worker/bin/cryosparcw', u'cache_path': None, u'cache_quota_mb': None, u'resource_fixed': {u'SSD': False}, u'cache_reserve_mb': 10000, u'type': u'node', u'ssh_str': u'cryosparc@chihiro', u'desc': None}]


#4

Everything seems like its in order there… can you send the output of cryosparcm joblog P28 J22?


#5

This gives the Errno 113 No route to host error (cryosparc is otherwise working normally). However, the instance was not on the most recent version, so I will update it and then try running the job and saving the job log after updating.


#6

Hey @DanielAsarnow, is this cleared up?


#7

After updating, dumping the joblog still gives ServerSelectionTimeoutError: ishtar:39001: [Errno 113] No route to host.

Ishtar is the cryoSPARC master node. It’s got DNS entries and is mapped to localhost, 127.0.0.1, and its static IP in the hosts file.
The running job also hangs at Running job on remote worker node hostname chihiro as before.

No hints in the systemd journal, either.


#8

Very strange… are the other worker nodes registered with ishtar as the master, or ishtar.ucsf.edu (or whatever the FQDN is)?
It seems that the master was able to ssh to the worker chihiro correctly to launch the job process, but the job fails when the worker attempts to connect back to the database at port 39001 on the master node. This failure is “silent” because the worker can’t even write it’s error traceback to the database.

Could there be something with firewalls/security permissions?
Can the worker ping the master at hostname ishtar?
What if you do
curl http://ishtar:39002
Should get Hello World from cryosparc command core.


#9

ishtar itself is the only other worker, this is the first time I’ve started adding remote nodes. curl http:ishtar:39002 from the remote worker works as expected.

I looked more closely at the firewall rules though, and used nmap to portscan the master from the worker. Turns out 39001 was still filtered because the zones had changed. I opened it up completely and now it’s running! Thank you!

PS it also fixed the joblog errno 113 error even though that was running on the master node. CentOS defaults are way too restrictive.