Irresponsive worker node after installation

marinegor · March 18, 2023, 12:24pm

Hi everyone,

I installed cryosparc_worker at 2 new worker nodes and ensured following:

shared partition is accessible
master node can connect via ssh with no password to all worker nodes
ports 39000-39010 are accessible on master node from all worker nodes
worker version is the same everywhere (v4.1.2)
nodes can be connected to the master (using cryosparcw connect ...)

However, when I submit jobs to newly connected lanes, they don’t seem to actually get started – they get stuck in “launched” state.

Also, cryosparcm test w P10 shows errors as well:

Since nodes were successfully connected, I can’t see any direction in which I can try to troubleshoot the situation. Also, job logs from the screenshot above don’t exist:

(base) cryosparcuser@cmm-1:~$ cryosparcm joblog P19 J47
/data/cryosparc_projects/P19/J47/job.log: No such file or directory

any suggestions?

wtempel · March 20, 2023, 12:30am

Just to confirm: Did you run cryosparcw connect on the workers (not the master)?
What are the outputs of

cryosparcm cli "get_scheduler_targets()"
cryosparcm eventlog P19 J47

marinegor · March 20, 2023, 8:20am

Did you run cryosparcw connect on the workers (not the master)?
Yes, and I also see these nodes among available ones in “Queue job” in cryosparc web-gui.

cryosparcm cli "get_scheduler_targets()"

----------------------------------------
	cache_path: /data/cryosparc_cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 0, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}]
	hostname: cmm-1
	lane: default
	monitor_port: None
	name: cmm-1
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
	ssh_str: cryosparcuser@cmm-1
	title: Worker node cmm-1
	type: node
	worker_bin_path: /opt/cryosparc/cryosparc_worker/bin/cryosparcw
----------------------------------------
	cache_path: /home/cryosparcuser/cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}]
	hostname: cmm2
	lane: slow_lane
	monitor_port: None
	name: cmm2
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'GPU': [1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
	ssh_str: cryosparcuser@cmm2
	title: Worker node cmm2
	type: node
	worker_bin_path: /opt/cryosparc/cryosparc_worker/bin/cryosparcw
----------------------------------------
	cache_path: /storage/cryosparcuser/cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	hostname: cmm3
	lane: cpu_only
	monitor_port: None
	name: cmm3
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], 'GPU': [], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}
	ssh_str: cryosparcuser@cmm3
	title: Worker node cmm3
	type: node
	worker_bin_path: /storage/apps/cryosparc/cryosparc_worker/bin/cryosparcw
----------------------------------------
	cache_path: /data/cryosparc_cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 1, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}]
	hostname: dragon
	lane: gtx1080
	monitor_port: None
	name: dragon
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], 'GPU': [1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
	ssh_str: cryosparcuser@dragon
	title: Worker node dragon
	type: node
	worker_bin_path: /home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw

cryosparcm eventlog P19 J47

(base) cryosparcuser@cmm-1:~$ cryosparcm eventlog P19 J47
License is valid.
Launching job on lane cpu_only target cmm3 ...
Running job on remote worker node hostname cmm3
**** Kill signal sent by unknown user ****

marinegor · March 20, 2023, 8:28am

I can also say that the jobs indeed don’t get submitted to the freshly installed worker nodes – at least I don’t see them with pgrep -af cryosparc even a few minutes after I click “submit” in the GUI.

marinegor · March 21, 2023, 1:49pm

another update is that, restarting the master node and re-connecting all the nodes doesn’t solve the issue either.

wtempel · March 23, 2023, 9:44pm

@marinegor Please can you

check if any of the “failing to start” GPU jobs other than cryosparcm test w wrote useful information to its respective job.log
email us the tgz file produced by the command
cryosparcm snaplogs

marinegor · March 23, 2023, 10:46pm

check if any of the “failing to start” GPU jobs other than cryosparcm test w wrote useful information to its respective job.log

No logs were produced.

email us the tgz file produced by the command
cryosparcm snaplogs

I had to kill the test job with Ctrl+C after few minutes, but nevertheless sent the archive.

wtempel · March 28, 2023, 8:20pm

If P19 J81 has not been deleted and did not produce any useful data, please can you try as cryosparcuser on host cmm-1:

ssh dragon bash -c "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"

and post the command’s output?

marinegor · March 28, 2023, 8:37pm

Hi, here’s the output:

(base) cryosparcuser@cmm-1:~$ ssh dragon bash -c "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"
Unknown cryosparcw command
(base) cryosparcuser@cmm-1:~$

wtempel · March 28, 2023, 9:06pm

Thanks @marinegor. What about the similar (without bash -c):

ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"

marinegor · March 28, 2023, 10:54pm

(base) cryosparcuser@cmm-1:~$ ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J81 --master_hostname cmm-1 --master_command_core_port 39002"


================= CRYOSPARCW =======  2023-03-29 01:53:28.823130  =========
Project P19 Job J81
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process at 2023-03-29 01:53:28.823230
MAINPROCESS PID 29294
Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 191, in make_request
    raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 173, in cryosparc_compute.run.run
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 126, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port), service="command_core")
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/client.py", line 36, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 91, in __init__
    self._reload()  # attempt connection immediately to gather methods
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 118, in _reload
    system = self._get_callable("system.describe")()
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func
    raise CommandClient.Error(
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://cmm-1:39002) Did not receive a JSON response from method "system.describe" with params ()
*** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known
*** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known
Process Process-1:
Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 191, in make_request
    raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://cmm-1:39002/api) URL Error [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc_master/cryosparc_compute/run.py", line 32, in cryosparc_compute.run.main
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 126, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port), service="command_core")
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_compute/client.py", line 36, in __init__
    super().__init__(service, host, port, url, timeout, headers, cls=NumpyEncoder)
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 91, in __init__
    self._reload()  # attempt connection immediately to gather methods
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 118, in _reload
    system = self._get_callable("system.describe")()
  File "/home/cryosparcuser/cryosparc_app/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 107, in func
    raise CommandClient.Error(
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://cmm-1:39002) Did not receive a JSON response from method "system.describe" with params ()

wtempel · March 29, 2023, 1:39pm

Interesting.
What about (as cryosparcuser@cmm-1)

curl 127.0.0.1:39002
ssh dragon "curl cmm-1:39002"
ssh dragon "host cmm-1"

?

marinegor · March 29, 2023, 3:18pm

(base) cryosparcuser@cmm-1:~$ curl 127.0.0.1:39002
Hello World from cryosparc command core.

then

(base) cryosparcuser@cmm-1:~$ ssh dragon "curl cmm-1:39002"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: cmm-1; Unknown error

and

(base) cryosparcuser@cmm-1:~$ ssh dragon "host cmm-1"
Host cmm-1 not found: 3(NXDOMAIN)

marinegor · March 29, 2023, 3:38pm

however, adding cmm-1 to /etc/hosts doesn’t solve the problem just yet – cryosparcm test w P19 still fails for the two new nodes, although the ssh dragon "curl cmm-1:39002" works:

(base) cryosparcuser@cmm-1:~$ ssh dragon "curl cmm-1:39002"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    41  100    41    0     0   3159      0 --:--:-- --:--:-- --:--:-Hello World from cryosparc command core.
-  3416

marinegor · March 29, 2023, 3:42pm

after the test failed:

(base) cryosparcuser@cmm-1:~$ cryosparcm test w P19
Using project P19
Running worker tests...
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL | Worker test results
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL | cmm-1
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL |   ✓ LAUNCH
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL |   ✓ SSD
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL |   ✓ GPU
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL | cmm2
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Error:
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     See P19 J95 for more information
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL | cmm3
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Error:
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     See P19 J98 for more information
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL | dragon
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Error:
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     See P19 J97 for more information
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2023-03-29 18:31:50,396 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,396 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2023-03-29 18:31:50,396 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed

I ran this:

(base) cryosparcuser@cmm-1:~$ ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J97 --master_hostname cmm-1 --master_command_core_port 39002"


================= CRYOSPARCW =======  2023-03-29 18:40:10.991691  =========
Project P19 Job J97
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process at 2023-03-29 18:40:10.991836
MAINPROCESS PID 47099
========= monitor process now waiting for main process
MAIN PID 47099
instance_testing.run cryosparc_compute.jobs.jobregister
***************************************************************
***************************************************************
========= main process now complete at 2023-03-29 18:40:20.386394.
========= monitor process now complete at 2023-03-29 18:40:20.405444.

which changed the status of J97 in the web-gui to “Completed” – although the node is still irresponsive in the computing sense.

marinegor · March 29, 2023, 4:05pm

and just in case, I tried reconnecting a dragon node using sshstr cryosparcuser@<actual_ip_address> – it didn’t help either.

wtempel · March 29, 2023, 5:24pm

marinegor:

	cache_path: /data/cryosparc_cache
	cache_quota_mb: None
	cache_reserve_mb: 10000
	desc: None
	gpus: [{'id': 1, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}]
	hostname: dragon
	lane: gtx1080
	monitor_port: None
	name: dragon
	resource_fixed: {'SSD': True}
	resource_slots: {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], 'GPU': [1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
	ssh_str: cryosparcuser@dragon
	title: Worker node dragon
	type: node
	worker_bin_path: /home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw

If your target list still contains this entry, what will be shown in the event and job logs when you send a (non-test) GPU job to dragon?

marinegor · March 29, 2023, 5:59pm

it just hangs at “started”.

marinegor · March 29, 2023, 9:33pm

@wtempel sorry, I was actually wrong – it hangs at “launched”:

the job log is empty in the GUI:

wtempel · March 29, 2023, 9:52pm

I still do not have a complete picture of the instance’s state, and therefore cannot suggest a path to recovery.
What, if anything, does the job log show?

I am surprised that dragon could have been connected under these circumstances. Do you recall the full cryosparcw connect command you used?
Is it now ensured that all workers can access CryoSPARC master ports using the cmm-1 hostname?