Worker connection timed out in v4.5.1

Hi there,
I just upgraded to v4.5.1. I got a timed out error (see below). How can I change this 15s limit to a longer time? Please let me know if any additional information is needed. Thanks!
Best,
Wei

Error message:
“Running job on remote worker node hostname xxxxx
Failed to launch! Command […] timed out after 15 seconds”

@wxh180 Please provide additional details

  1. Was the job submitted to a node-type or cluster-type lane?
  2. What was the command that timed out?

@wtempel See my answers below. Let me know if any additional information is needed. Thanks!

  1. It is a node-type
  2. Here is the complete log:
    "License is valid.

Launching job on lane pp2a_4u target c109864.phrm.CWRU.Edu …

Running job on remote worker node hostname c109864.phrm.CWRU.Edu

Failed to launch! Command [‘ssh’, ‘cryosparc_user@worker1’, ‘bash -c "nohup /data/Programs/cryosparc_08242023/cryosparc_worker/bin/cryosparcw run --project P6 --job J334 --master_hostname tayloret.phrm.cwru.edu --master_command_core_port 39002 > /taylorNAS3/cryosparc_projects/4U_PP2A/P46/J334/job.log 2>&1 & “’] timed out after 15 seconds”

Thanks @wxh180. Some additional questions:

  1. Did you run jobs on the c109864 worker node using this CryoSPARC installation before the upgrade to v4.5.1?
  2. Please can you run these commands on tayloret.phrm.cwru.edu and post their outputs:
    cryosparcm cli "get_scheduler_targets()"
    ls -al /taylorNAS3/cryosparc_projects/4U_PP2A/P46/J334/
    

@wtempel

  1. Yes. It was running okay before the upgrade, but there did have a lag in communication between the master and the worker. I guess extending the 15s time-out limit could solve the problem.
  2. Running these commands showing that c109864 is still a worker and cryosparc has access to the job directory.
    Let me know if any additional information is needed.
    Thanks!

You may want to investigate and resolve the cause of that lag. Did you confirm that after a lag of more than 15 seconds, commands similar to

would succeed instead of eventually failing anyway? How long would that lag be? A lag as long as 15 seconds indicates a problem that would likely affect data processing in other ways.

Seeing the ls and cryosparcm commands’ outputs may lead to some follow-up questions.

@wtempel If we can get this lag issue resolved, that will be great. If we directly ping the machine, it looks normal. The lag seems to be within cryosparc.

Here are the output for the two command lines you suggested. Let me know if there is any.
###########################################################################

(base) [whuang@tayloret ~]$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/scr/scratch/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 42314694656, 'name': 'NVIDIA A100-PCIE-40GB'}, {'id': 1, 'mem': 42314694656, 'name': 'NVIDIA A100-PCIE-40GB'}], 'hostname': '129.22.208.53', 'lane': 'a100', 'monitor_port': None, 'name': '129.22.208.53', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]}, 'ssh_str': 'whuang@129.22.208.53', 'title': 'Worker node 129.22.208.53', 'type': 'node', 'worker_bin_path': '/data/Programs/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '/data/cryosparc_dir', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11544035328, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 4, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 5, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 6, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 7, 'mem': 11546394624, 'name': 'NVIDIA GeForce RTX 2080 Ti'}], 'hostname': 'c109864.phrm.CWRU.Edu', 'lane': 'pp2a_4u', 'monitor_port': None, 'name': 'c109864.phrm.CWRU.Edu', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]}, 'ssh_str': 'whuang@pp2a', 'title': 'Worker node c109864.phrm.CWRU.Edu', 'type': 'node', 'worker_bin_path': '/data/Programs/cryosparc_08242023/cryosparc_worker/bin/cryosparcw'}]
(base) [whuang@tayloret ~]$ ls -al /taylorNAS3/cryosparc_projects/4U_PP2A/P46/J334/
total 37
drwxrwxr-x.   3 whuang whuang     5 May  9 04:56 .
drwxrwxr-x. 341 whuang whuang   347 May  9 04:56 ..
-rw-rw-r--.   1 whuang whuang    18 May  9 04:56 events.bson
drwxrwxr-x.   2 whuang whuang     2 May  9 04:56 gridfs_data
-rw-rw-r--.   1 whuang whuang 61359 May  9 04:56 job.json

Is there a reason for this custom ssh string? Please can you post the outputs of these commands (run on tayloret):

host c109864.phrm.CWRU.Edu
host pp2a
grep -i -A 2 pp2a ~whuang/.ssh/config
time ssh whuang@c109864.phrm.CWRU.Edu sleep 5
time ssh whuang@pp2a sleep 5
1 Like

@wtempel Not really. “pp2a” is just an alias defined in ~/.ssh/config

Here are the output:
(base) [whuang@tayloret ~]$ time ssh whuang@c109864.phrm.CWRU.Edu sleep 5

real 0m53.076s
user 0m0.025s
sys 0m0.017s
(base) [whuang@tayloret ~]$ time ssh whuang@pp2a sleep 5

real 0m50.749s
user 0m0.017s
sys 0m0.016s

This delay is unexpectedly long.

The timeout limit can be extended by adding a definition like

export CRYOSPARC_JOB_LAUNCH_TIMEOUT_SECONDS=120

to the file cryosparc_master/config.sh and restarting CryoSPARC (when no CryoSPARC job is running).
This variable does not resolve the underlying cause of the delay, which may affect data processing in other ways. You or your IT support may want to look into the cause.