New hostname and failed worker nodes

We ordered a linux workstation with Cryosparc preinstalled. Everything worked fine until the hostname and IP address was changed. I can login but Jobs fail saying:
ssh: could not resolve hostname c115108: Name or service not known.
To resolve this, from the cryosparc_worker/bin directory I tried:
./cryosparcw connect --worker c115108 --master ‘hostname.edu’ --port 39000 --ssdpath /scr/cryosparc_cache --update

However, when I rerun jobs, I still get the initial error ‘could not resolve hostname c115108’. I did update to version 4.1.1.

Welcome to the forum @rmcnulty.
In a typical setup,

  • the CRYOSPARC_MASTER_HOSTNAME value defined inside /path/to/cryosparc_master/config.sh should match the output of the command
    hostname -f and the --master parameter in cryosparcw connect
  • host $(hostname -f) confirms that the hostname is resolved to an ip address assigned to the master computer (resolution must work an the master and all workers)
  • the value of the --worker parameter in cryosparcw connect must resolve to the ip address of the correct worker, match the --master parameter in case of a combined master/worker “standalone” instance

If you require additional help, please describe your CryoSPARC instance

  • “standalone” (combined master/worker) or master with separate worker(s)
  • output of
    cryosparcm cli "get_scheduler_targets()"
  • on the master, outputs of
    1. hostname -f
    2. host $(hostname)

Config.sh has:
export CRYOSPARC_MASTER_HOSTNAME=“redacted.edu”

  • Master with separate workers.

  • [{‘cache_path’: ‘/scr/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434587136, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434587136, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434587136, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434193920, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘c115108’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘c115108’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@c115108’, ‘title’: ‘Worker node c115108’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc_user/software/cryosparc/cryosparc_worker/bin/cryosparcw’}]

On the master

  1. [cryosparc_user@defcon2 cryosparc_worker]$ hostname -f
    redacted.edu

  2. [cryosparc_user@defcon2 cryosparc_worker] host (hostname)
    redacted.edu has address redacted

Assuming that hostname host1.dept.univ.edu (substitute the actual hostname, which has been redacted in your post) assignment is permanent, you may want to try:

  1. cryosparcm "remove_scheduler_target_node('c115108')" (run on master, guide)
  2. run on worker
    ./cryosparcw connect --worker "host1.dept.univ.edu" --master "host1.dept.univ.edu" --port 39000 --ssdpath /scr/cryosparc_cache
    

Please ensure that ports 39001…39010 are accessible only to this host itself (and, potentially, future additional CryoSPARC workers).

The suggested worker connection procedure assumes that master and worker are running on the same host, consistent with the CRYOSPARC_MASTER_HOSTNAME provided and the single worker definition in the get_scheduler_targets() output, as well as assuming hostname -f was executed on the worker. If the worker to be connected is in fact a different computer (from the master), please substitute --worker and --master parameters accordingly.

We are back up and running. Your solution worked!

I edited the port range. Assuming a CRYOSPARC_BASE_PORT=39000 assignment, port 39000 is used for access to the CryoSPARC web application, but not currently used for communication between the master and worker(s) (see guide).

netstat -tuplen | grep :3900
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:39006 0.0.0.0:* LISTEN 1001 627552 17095/node
tcp 0 0 0.0.0.0:39001 0.0.0.0:* LISTEN 1001 627458 16867/mongod
tcp 0 0 0.0.0.0:39002 0.0.0.0:* LISTEN 1001 627486 16973/python
tcp 0 0 0.0.0.0:39003 0.0.0.0:* LISTEN 1001 661060 17003/python
tcp 0 0 0.0.0.0:39005 0.0.0.0:* LISTEN 1001 661043 17038/python
tcp6 0 0 :::39000 :::* LISTEN 1001 628080 17072/node

nc -zv “host1.dept.univ.edu” 39000-39010

Connection to “host1.dept.univ.edu” port 39000 [tcp/*] succeeded!

nc: connectx to “host1.dept.univ.edu” port 39001 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39002 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39003 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39004 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39005 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39006 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39007 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39008 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39009 (tcp) failed: Connection refused

nc: connectx to “host1.dept.univ.edu” port 39010 (tcp) failed: Connection refused