SSH error after IP/subdomain move, also upgraded

Upgraded from v4.4.0 to v4.4.1 and also moved to another network/vlan on site.
IP subnet changed as did the subdomain name.

When users try to start a job they are seeing error
Failed to launch! 255 ssh: connect to host suraj port 22: Connection timed out

I have checked, cryosparc_master/config.sh shows fqdn of suraj.esplab.wadsworth.org which is the new correct setting. Also cryosparc_force_hostname is set to true.

I ran the cryosparc_worker command to connect working to master, specific “localhost” for both worker
and master nodes.

Added ssh keys by running # ssh suraj.esplab" and ssh localhost, that is working, and I also
created public/private key pair and passwordless login is working.

I don’t know what I missed, if anyone can point me at the right answer or the best way to correct?

Oh - platform is Ubuntu 20.04.6 LTS

thanks in adnvace,
Brian

I would recommend this setting only under exceptional circumstances.
Please can you

  • post the outputs of these commands
    grep -v LICENSE /path/to/cryosparc_master/config.sh
    cryosparcm cli "get_scheduler_targets()"
    host suraj.esplab.wadsworth.org
    
  • let us know if you intend to run CryoSPARC jobs on any nodes other than suraj.esplab.wadsworth.org now or in the foreseeable future
    [Edited 2024-01-05]

cryosparc_user@suraj:~$ grep -v LICENSE cryosparc_master/config.sh

export CRYOSPARC_MASTER_HOSTNAME=“suraj.esplab.wadsworth.org
export CRYOSPARC_DB_PATH=“/home/cryosparc_user/cryosparc_database”
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_FORCE_HOSTNAME=true

cryosparc_user@suraj:~/cryosparc_master$ bin/cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 2, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 3, ‘mem’: 11543379968, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘suraj’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘suraj’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@suraj’, ‘title’: ‘Worker node suraj’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 2, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 3, ‘mem’: 11543379968, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}], ‘hostname’: ‘localhost’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘localhost’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@localhost’, ‘title’: ‘Worker node localhost’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}]

cryosparc_user@suraj:~/cryosparc_master$ host suraj.esplab.wadsworth.org
suraj.esplab.wadsworth.org has address 10.50.148.212

While the IP did change, so did the domain name, original “esp.wadsworth.org” rather than
esplab.wadsworth.org”.

thanks,
Brian

Continued to dig, appears the issue is that my Cryosparc node, a dhcpclient is not setting its DNS domain name, field is blank.

While I can connect the worker to the master (localhost) I can not do so with the nodename, only with local host. domainname is an issue, but I’d thought with localhost it would work as we do connect.

Will continue to pursue as dhcpclient issue, if there are suggestions regarding cryosparc config please let me know.

FYI, if I unset CRYOSPARC_FORCE_HOSTNAME, I’m asked to set CRYOSPARC_HOSTNAME_CHECK which seems to want FQND rather than cname.
This may/probably be related to domainname issue.

ran connect with FQDN, will test.

Resolves correctly using the domain name AD (active directory) uses, so I’d think the cname Suraj would work fine, or local host.

thanks,
Brian

@BrianCuttler Please can you comment on

and post the output of the command
hostname -f
The most straightforward setup would be where
hostname -f output, $CRYOSPARC_MASTER_HOSTNAME and, for a combined master/worker host, the scheduler target’s 'hostname': value all match exactly.

Thanks for the help, kept digging and find that the domain name is wrong, missing actually.
Worked around by updating the /etc/host file include the (sub)domainname which reconnected
master and slave (master and slave same system, will not be any other slaves). This seems to
have caused a failure in name in the /scatch directory, resolved by renaming host:39001 to host.domain:39001, which we will revert, if needed, after correcting DNS naming.

subdomain should have been set via DHCP, pursuing this as a non-cryosparc issue, just took a
while to dig out that the underlying networking wasn’t working as expected.

Thanks for your help, puzzled by networking issue, but Cryosparc issues were the symptom only
and never where the problem actually resided.
thanks - Brian