Job halts at queue state due to hostname issue

Before I get to the question, let me explain what happened: After an attempt to update update cryoSPARC v3.2.2 to v4.2.1 failed. I did clean installation of v4.2.1. cryoSPARC was able to start. However, job halts at queue state forever. I saw the same or similar problems discussed in this board indicating that worker node may not be updated or/and worker node may not be connected to master. So,

    1. I used “bin/cryosparcw update –override” to update worker node. But still not working.
    1. Then, I used “./bin/cryosparcw connect” to connect work and master nodes.

After this operation, Job got started. However, I got the following error message:

[2023-05-22 19:49:13.21] Launching job on lane default target /home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker …

[2023-05-22 19:49:13.25] Running job on remote worker node hostname /home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker

[2023-05-22 19:49:13.26]

Failed to launch! 255 ssh: Could not resolve hostname /home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker: Name or service not known

It is very strange in that hostname is the path to the worker node. Hostname should be “takagiws”. I ran the command “cryosparcm cli “get_scheduler_targets()” discussed in this board. Then I got the following outcome:

cryosparcm cli “get_scheduler_targets()”

[{‘cache_path’: ‘/home/takagilab/diskarray/software/cryoem/cryosparc/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25664684032, ‘name’: ‘Quadro M6000 24GB’}], ‘hostname’: ‘/home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘/home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]}, ‘ssh_str’: ‘takagilab@/home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker’, ‘title’: ‘Worker node /home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker/bin/cryosparcw’}]

Again it looks like hostname is totally screwed up. It’s possible that I may have screwed up when I used “./bin/cryosparcw connect” command.

My question is: how can I fix this hostname issue?

Thanks for your help

Best

Yuro

Your hostname should be in cryosparc_master/config.sh

Start by double checking there that it’s correct. Also make sure you can ping the hostname from the cryosparc_master computer.

It could just be the way the error is written, but it looks like it’s using the executable path for the hostname which is odd.

This looks like cryosparcw connect has been run with an incorrect --worker specification.
I am not sure cryosparcw connect with the --update flag can correct this problem.
You could try (details)

cryosparcm cli "remove_scheduler_target_node('/home/takagilab/diskarray/software/cryoem/cryosparc/cryosparc_worker')"

The most suitable cryosparcw connect parameters (details) depend on your situation (and plans):

  • the value of CRYOSPARC_MASTER_HOSTNAME defined inside /path/to/cryosparc_master/config.sh
  • are master and worker combined on the same host?
  • will there be additional worker nodes in the foreseeable future?
  • can you ping the hostname defined by CRYOSPARC_MASTER_HOSTNAME from the cryosparc_master computer?

Dear Wtempel
Thanks for your reply. A few things here:

I ran cryosparcm cli "remove_scheduler_target_node(xxx. command and it appears to delete the setting since running cryosparcm cli “get_scheduler_targets()” got nothing.

  • the value of CRYOSPARC_MASTER_HOSTNAME defined inside /path/to/cryosparc_master/config.sh
  • are master and worker combined on the same host?
  • will there be additional worker nodes in the foreseeable future?

config.sh file indicate that hostname is “Takagiws”. Running Ping takagiws shows a network connection.

For this particular workstation, I won’t add any additional worker node. It looks like when I run “./bin/cryosparcw connect”, I placed worker path in the place where I had to put hostname. If I want to run connect command, I need to specify hostname for “worker”. The hostname of worker and master is the same?

Thanks for your help
Best
Yuro

Capital letters in the value of the CRYOSPARC_MASTER_HOSTNAME variable are problematic.
May I suggest:

  1. stop CryoSPARC
    cryosparcm stop
  2. modify the definition inside /path/to/cryosparc_master/config.sh to:
    export CRYOSPARC_MASTER_HOSTNAME="takagiws"
  3. cryosparcm start

Yes. Specify --worker takagiws and --master takagiws, as well as other required parameters.

1 Like

Dear wtempel

Sorry that I mis-spelled. The hostname is in all small capital - “Takagiws”. I ran “connect” command by setting worker hostname correctly. Finally, cryoSPARC is running!

Thanks for your help
Best

Yuro