Irresponsive worker node after installation

Hi everyone,

I installed cryosparc_worker at 2 new worker nodes and ensured following:

  • shared partition is accessible
  • master node can connect via ssh with no password to all worker nodes
  • ports 39000-39010 are accessible on master node from all worker nodes
  • worker version is the same everywhere (v4.1.2)
  • nodes can be connected to the master (using cryosparcw connect ...)

However, when I submit jobs to newly connected lanes, they don’t seem to actually get started – they get stuck in “launched” state.

Also, cryosparcm test w P10 shows errors as well:

Since nodes were successfully connected, I can’t see any direction in which I can try to troubleshoot the situation. Also, job logs from the screenshot above don’t exist:

(base) cryosparcuser@cmm-1:~$ cryosparcm joblog P19 J47
/data/cryosparc_projects/P19/J47/job.log: No such file or directory

any suggestions?

Just to confirm: Did you run cryosparcw connect on the workers (not the master)?
What are the outputs of

  1. cryosparcm cli "get_scheduler_targets()"
  2. cryosparcm eventlog P19 J47

Did you run cryosparcw connect on the workers (not the master)?
Yes, and I also see these nodes among available ones in “Queue job” in cryosparc web-gui.

  • cryosparcm cli "get_scheduler_targets()"
----------------------------------------
	cache_path: /data/cryosparc_cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 0, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}]
	hostname: cmm-1
	lane: default
	monitor_port: None
	name: cmm-1
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
	ssh_str: cryosparcuser@cmm-1
	title: Worker node cmm-1
	type: node
	worker_bin_path: /opt/cryosparc/cryosparc_worker/bin/cryosparcw
----------------------------------------
	cache_path: /home/cryosparcuser/cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}]
	hostname: cmm2
	lane: slow_lane
	monitor_port: None
	name: cmm2
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
	ssh_str: cryosparcuser@cmm2
	title: Worker node cmm2
	type: node
	worker_bin_path: /opt/cryosparc/cryosparc_worker/bin/cryosparcw
----------------------------------------
	cache_path: /storage/cryosparcuser/cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	hostname: cmm3
	lane: cpu_only
	monitor_port: None
	name: cmm3
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], 'GPU': [], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
	ssh_str: cryosparcuser@cmm3
	title: Worker node cmm3
	type: node
	worker_bin_path: /storage/apps/cryosparc/cryosparc_worker/bin/cryosparcw
----------------------------------------
	cache_path: /data/cryosparc_cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 1, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}]
	hostname: dragon
	lane: gtx1080
	monitor_port: None
	name: dragon
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], 'GPU': [1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
	ssh_str: cryosparcuser@dragon
	title: Worker node dragon
	type: node
	worker_bin_path: /home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw
  • cryosparcm eventlog P19 J47
(base) cryosparcuser@cmm-1:~$ cryosparcm eventlog P19 J47
License is valid.
Launching job on lane cpu_only target cmm3 ...
Running job on remote worker node hostname cmm3
**** Kill signal sent by unknown user ****

I can also say that the jobs indeed don’t get submitted to the freshly installed worker nodes – at least I don’t see them with pgrep -af cryosparc even a few minutes after I click “submit” in the GUI.

another update is that, restarting the master node and re-connecting all the nodes doesn’t solve the issue either.

@marinegor Please can you

  1. check if any of the “failing to start” GPU jobs other than cryosparcm test w wrote useful information to its respective job.log
  2. email us the tgz file produced by the command
    cryosparcm snaplogs
  • check if any of the “failing to start” GPU jobs other than cryosparcm test w wrote useful information to its respective job.log

No logs were produced.

  • email us the tgz file produced by the command
    cryosparcm snaplogs

I had to kill the test job with Ctrl+C after few minutes, but nevertheless sent the archive.

If P19 J81 has not been deleted and did not produce any useful data, please can you try as cryosparcuser on host cmm-1:

ssh dragon bash -c "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"

and post the command’s output?

Hi, here’s the output:

(base) cryosparcuser@cmm-1:~$ ssh dragon bash -c "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"
Unknown cryosparcw command
(base) cryosparcuser@cmm-1:~$

Thanks @marinegor. What about the similar (without bash -c):

ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"
(base) cryosparcuser@cmm-1:~$ ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"


================= CRYOSPARCW =======  2023-03-29 01:53:28.823130  =========
Project P19 Job J81
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process at 2023-03-29 01:53:28.823230
MAINPROCESS PID 29294
Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 191, in make_request
    raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 173, in cryosparc_compute.run.run
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 126, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port), service="command_core")
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/client.py", line 36, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 91, in __init__
    self._reload()  # attempt connection immediately to gather methods
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 118, in _reload
    system = self._get_callable("system.describe")()
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func
    raise CommandClient.Error(
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://cmm-1:39002) Did not receive a JSON response from method "system.describe" with params ()
*** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known
*** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known
Process Process-1:
Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 191, in make_request
    raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc_master/cryosparc_compute/run.py", line 32, in cryosparc_compute.run.main
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 126, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port), service="command_core")
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/client.py", line 36, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 91, in __init__
    self._reload()  # attempt connection immediately to gather methods
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 118, in _reload
    system = self._get_callable("system.describe")()
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func
    raise CommandClient.Error(
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://cmm-1:39002) Did not receive a JSON response from method "system.describe" with params ()