Irresponsive worker node after installation

(base) cryosparcuser@cmm-1:~$ curl 127.0.0.1:39002
Hello World from cryosparc command core.

then

(base) cryosparcuser@cmm-1:~$ ssh dragon "curl cmm-1:39002"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: cmm-1; Unknown error

and

(base) cryosparcuser@cmm-1:~$ ssh dragon "host cmm-1"
Host cmm-1 not found: 3(NXDOMAIN)

however, adding cmm-1 to /etc/hosts doesn’t solve the problem just yet – cryosparcm test w P19 still fails for the two new nodes, although the ssh dragon "curl cmm-1:39002" works:

(base) cryosparcuser@cmm-1:~$ ssh dragon "curl cmm-1:39002"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    41  100    41    0     0   3159      0 --:--:-- --:--:-- --:--:-Hello World from cryosparc command core.
-  3416

after the test failed:

(base) cryosparcuser@cmm-1:~$ cryosparcm test w P19
Using project P19
Running worker tests...
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL | Worker test results
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL | cmm-1
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL |   ✓ LAUNCH
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL |   ✓ SSD
2023-03-29 18:31:50,382 WORKER_TEST          log                  CRITICAL |   ✓ GPU
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL | cmm2
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Error:
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     See P19 J95 for more information
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL | cmm3
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Error:
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     See P19 J98 for more information
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL | dragon
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ✕ LAUNCH
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     Error:
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |     See P19 J97 for more information
2023-03-29 18:31:50,395 WORKER_TEST          log                  CRITICAL |   ⚠ SSD
2023-03-29 18:31:50,396 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed
2023-03-29 18:31:50,396 WORKER_TEST          log                  CRITICAL |   ⚠ GPU
2023-03-29 18:31:50,396 WORKER_TEST          log                  CRITICAL |     Did not run: Launch test failed

I ran this:

(base) cryosparcuser@cmm-1:~$ ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J97 --master_hostname cmm-1 --master_command_core_port 39002"


================= CRYOSPARCW =======  2023-03-29 18:40:10.991691  =========
Project P19 Job J97
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process at 2023-03-29 18:40:10.991836
MAINPROCESS PID 47099
========= monitor process now waiting for main process
MAIN PID 47099
instance_testing.run cryosparc_compute.jobs.jobregister
***************************************************************
***************************************************************
========= main process now complete at 2023-03-29 18:40:20.386394.
========= monitor process now complete at 2023-03-29 18:40:20.405444.

which changed the status of J97 in the web-gui to “Completed” – although the node is still irresponsive in the computing sense.

and just in case, I tried reconnecting a dragon node using sshstr cryosparcuser@<actual_ip_address> – it didn’t help either.

If your target list still contains this entry, what will be shown in the event and job logs when you send a (non-test) GPU job to dragon?

it just hangs at “started”.

@wtempel sorry, I was actually wrong – it hangs at “launched”:

image

the job log is empty in the GUI:

I still do not have a complete picture of the instance’s state, and therefore cannot suggest a path to recovery.
What, if anything, does the job log show?

I am surprised that dragon could have been connected under these circumstances. Do you recall the full cryosparcw connect command you used?
Is it now ensured that all workers can access CryoSPARC master ports using the cmm-1 hostname?

I still do not have a complete picture of the instance’s state, and therefore cannot suggest a path to recovery.
If that’d be fine to you, could I suggest a ~15 minutes live debug session? it might be faster.

What, if anything, does the job log show?

(base) cryosparcuser@cmm-1:~$ ssh dragon "host cmm-1"
Host cmm-1 not found: 3(NXDOMAIN)

but it might be the host utility problem on centos – the other one works just fine:

(base) cryosparcuser@cmm-1:~$ ssh dragon "getent hosts cmm-1"
<the real IP here>    cmm-1 cmm1

I am surprised that dragon could have been connected under these circumstances. Do you recall the full cryosparcw connect command you used?

[cryosparcuser@dragon ~]$ ./cryosparc_app/cryosparc_worker/bin/cryosparcw connect --worker dragon --master <correct ip here> --port 39000 --ssdpath /data/cryosparc_cache --gpus "1,2,3" --ssdquota 500000 --lane gtx1080 --sshstr cryosparcuser@dragon --newlane

 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker dragon to command <correct ip here>:39002
  Connecting as unix user cryosparcuser
  Will register using ssh string: cryosparcuser@<real ip here>
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string>
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    cmm-1
    cmm2
 ---------------------------------------------------------------
  Autodetecting available GPUs...
  Detected 4 CUDA devices.

   id           pci-bus  name
   ---------------------------------------------------------------
       0      0000:03:00.0  GeForce GTX 1080 Ti
       1      0000:04:00.0  GeForce GTX 1080 Ti
       2      0000:81:00.0  GeForce GTX 1080 Ti
       3      0000:82:00.0  GeForce GTX 1080 Ti
   ---------------------------------------------------------------
   Devices specified: 1, 2, 3
   Devices 1, 2, 3 will be enabled now.
   This can be changed later using --update
 ---------------------------------------------------------------
  Worker will be registered with SSD cache location /data/cryosparc_cache
 ---------------------------------------------------------------
  Autodetecting the amount of RAM available...
  This machine has 128.65GB RAM .
 ---------------------------------------------------------------
 ---------------------------------------------------------------
  Registering worker...
  Done.

  You can now launch jobs on the master node and they will be scheduled
  on to this worker node if resource requirements are met.
 ---------------------------------------------------------------
  Final configuration for dragon
               cache_path :  /data/cryosparc_cache
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 1, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}]
                 hostname :  dragon
                     lane :  gtx1080
             monitor_port :  None
                     name :  dragon
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], 'GPU': [1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}
                  ssh_str :  cryosparcuser@<real ip here>
                    title :  Worker node dragon
                     type :  node
          worker_bin_path :  /home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------

Is it now ensured that all workers can access CryoSPARC master ports using the cmm-1 hostname?

yes, I can e.g. ping them:

[cryosparcuser@dragon ~]$ ping cmm1
PING cmm-1 (10.55.229.12) 56(84) bytes of data.
64 bytes from cmm-1 (10.55.229.12): icmp_seq=1 ttl=61 time=0.476 ms
64 bytes from cmm-1 (10.55.229.12): icmp_seq=2 ttl=61 time=0.512 ms

or check if the ports are accessible:

[cryosparcuser@dragon ~]$ for p in `seq 39000 1 39010`; do printf "$p "; if $(nc -zv 10.55.229.12 $p &> /dev/null); then echo available; else echo unavailable; fi; done
39000 available
39001 available
39002 available
39003 available
39004 unavailable
39005 available
39006 available
39007 unavailable
39008 unavailable
39009 unavailable
39010 unavailable

ports 39004 and 39007-39010 are actually allowed on cmm-1 host too, but I believe there are no services running.

I see. Even if specification of an IP address may work for cryosparcw connect, jobs will still fail if the worker node cannot resolve the the hostname defined via CRYOSPARC_MASTER_HOSTNAME (most likely inside /path/to/cryosparc_master/config.sh).
Please ensure on all worker nodes that the curl command does not fail like it did on dragon

Please ensure on all worker nodes that the curl command does not fail like it did on dragon

Done, it doesn’t – I get Hello World from cryosparc command core from all worker nodes (1 good and 2 flawed)

I suggest additional tests like

To see if additional errors are encountered.

I choose a job that was submitted to dragon node and “Killed” during cryosparcm test command. This is the output:

(base) cryosparcuser@cmm-1:~$ ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P19 --job J103 --master_hostname cmm-1 --master_command_core_port 39002"


================= CRYOSPARCW =======  2023-03-31 23:22:47.788532  =========
Project P19 Job J103
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process at 2023-03-31 23:22:47.788628
MAINPROCESS PID 36572
========= monitor process now waiting for main process
MAIN PID 36572
instance_testing.run cryosparc_compute.jobs.jobregister
***************************************************************
***************************************************************
========= main process now complete at 2023-03-31 23:22:55.329973.
========= monitor process now complete at 2023-03-31 23:22:55.354094.

and job status changes to “Completed” in the gui.

Please can you try something similar with the clone of a non-test job, like 2D classification

  1. clone a 2D classification job.
  2. queue the clone via the GUI
  3. kill (via GUI) the cloned job immediately after it transitioned to “launched”
  4. run the job via
    ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run ..
    
  5. observe terminal output

if I do that, it’s able to run!

(base) cryosparcuser@cmm-1:~$ ssh dragon "/home/cryosparcuser/cryosparc_app/cryosparc_worker/bin/cryosparcw run --project P74 --job J69 --master_hostname cmm-1 --master_command_core_port 39002"


================= CRYOSPARCW =======  2023-04-01 00:03:07.343361  =========
Project P74 Job J69
Master cmm-1 Port 39002
===========================================================================
========= monitor process now starting main process at 2023-04-01 00:03:07.343432
MAINPROCESS PID 45199
========= monitor process now waiting for main process
MAIN PID 45199
class2D.run cryosparc_compute.jobs.jobregister
========= sending heartbeat at 2023-04-01 00:03:24.097975
...

and I also see meaningful job log – namely, it imported the module and started caching particles on SSD, and I also can see the particles appearing there.

What are event and job logs now if you queue and run the job as usual (via GUI)?

Nothing happens, it stays at “Launched”:

(base) cryosparcuser@cmm-1:~$ cryosparcm joblog P74 J70
/slowdata/cryosparc/cryosparc_projects/<project_name>/J70/job.log: No such file or directory

ok, a (bad) update – it’s now true for the only one non-worker lane what was functional.
Moreover, if I queue a job to this lane, it renders the CryoSPARC instance irresponsible – see here.

I’ve stopped connecting nodes for now, but I’d really appreciate a way to get the lanes back :frowning_face:

Is this the same CryoSPARC instance as in Unknown 504 error when dealing with jobs (after reboot)?

yes, this is the same instance as before. I initially had the issue with newly connected lanes, but then after rebooting master started having troubles with previously connected lane that worked fine early.