Job Halts in Launched State

dzyla · March 31, 2021, 11:51pm

Hi,

I have a very similar problem at @yliucj. I have run all the commands as suggested and unfortunately cannot tell where the problem is.

I run the command on master node that would correspond to my setup:

ssh cryosparc_user@10.0.90.38 /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname 10.0.90.57 --master_command_core_port 39002

As a result in the command line, I get:

================= CRYOSPARCW =======  2021-03-31 19:44:30.222932  =========
Project P29 Job J22
Master 10.0.90.57 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 187193
========= monitor process now waiting for main process
MAIN PID 187193
helix.run_refine cryosparc_compute.jobs.jobregister
***************************************************************
Running job  J22  of type  helix_refine
Running job on hostname %s joao
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'joao', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/mnt/SSD1/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554324480, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'joao', 'lane': 'default', 'monitor_port': None, 'name': 'joao', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@10.0.90.38', 'title': 'Worker node joao', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

In the project in the browser, I got this error:

[CPU: 213.8 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py”, line 220, in cryosparc_compute.jobs.helix.run_refine.run
File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/particles.py”, line 31, in init
self.from_dataset(d) # copies in all data
File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/dataset.py”, line 473, in from_dataset
if len(other) == 0: return self
TypeError: object of type ‘NoneType’ has no len()

What is interesting, if I schedule a job from the browser, there is nothing happening, and job is just halted.

EDIT:

I checked cryosparcm log command this is what I get:

Jobs Queued: [(‘P29’, ‘J22’)]
Licenses currently active : 0
Now trying to schedule J22
Need slots : {‘CPU’: 4, ‘GPU’: 1, ‘RAM’: 3}
Need fixed : {‘SSD’: True}
Master direct : False
Running job directly on GPU id(s): [0] on joao
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
Launchable! – Launching.
Changed job P29.J22 status launched
Running project UID P29 job UID J22
Running job on worker type node
Running job using: /home/cryosparc_user/cryosparc_worker/bin/cryosparcw
Running job on remote worker node hostname joao
cmd: bash -c "nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname chris --master_command_core_port 39002 > /mnt/12T_HDD1/P29/J22/job.log 2>&1 & "

Is it possible that master_hostname is not correct and the IP address should be there instead?

If you had any idea how to solve that, please let me know!

EDIT2:
I was following the cryosparcm log command_core when Master was sending a job to the worker, but it seems like that worker never receives the command or there is a different issue. I tried to run the corresponding command on the worker, but error occurred:

nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname 10.0.90.57 --master_command_core_port 39002 > /mnt/12T_HDD1/P29/J22/job.log 2>&1 
-bash: /mnt/12T_HDD1//P29/J22/job.log: No such file or directory

When I removed 2>&1, command went further, but crashed anyway:

================= CRYOSPARCW =======  2021-04-01 14:31:10.613405  =========
Project P29 Job J22
Master 10.0.90.57 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 199827
MAIN PID 199827
========= monitor process now waiting for main process
helix.run_refine cryosparc_compute.jobs.jobregister
***************************************************************
Running job  J22  of type  helix_refine
Running job on hostname %s joao
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'joao', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/mnt/SSD1/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554324480, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'joao', 'lane': 'default', 'monitor_port': None, 'name': 'joao', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@10.0.90.38', 'title': 'Worker node joao', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py", line 220, in cryosparc_compute.jobs.helix.run_refine.run
  File "/home/cryosparc_user/cryosparc_worker/cryosparc_compute/particles.py", line 31, in init
    self.from_dataset(d) # copies in all data
  File "/home/cryosparc_user/cryosparc_worker/cryosparc_compute/dataset.py", line 473, in from_dataset
    if len(other) == 0: return self
TypeError: object of type 'NoneType' has no len()
========= main process now complete.
========= monitor process now complete.

The same job run on the master node works perfectly fine.

RyanFeathers · April 13, 2021, 6:33pm

@dzyla I don’t know if you’re still having this issue but I came across your post because I ran into the exact same problems. Some of the ports were already open but opening the remaining ones from 39000-39005 on the master fixed everything.

dzyla · April 14, 2021, 7:34pm

Hi @RyanFeathers,

Thank you so much for the tip. I have opened ports 39000-39008 on both machines but unfortunately, the problem remains: The job starts but halts forever. Both machines have a passwordless connection that was tested but somehow cryosparc does not reach the worker.

RyanFeathers · April 14, 2021, 8:14pm

@dzyla I’m sorry to hear that didn’t work for you. My error logs were almost identical to everything you posted and as soon as I opened the last port the stalled job started.

One last thing I noticed though was the error about the path. Are you sure that the drive where the data is located is accessible (w/r/x) to both machines? That was an earlier issue I had as well.

HamishGBrown · April 5, 2023, 3:37am

Just noting that I had this exact problem with a worker, the issue in my case was that the central storage drive had not properly mounted on that worker after a restart. Everything was restored to working order once this drive was remounted properly. cryosparcm joblog didn’t provide a lot of information (the log file was never created since the worker couldn’t find the directory) but I eventually diagnosed by copy-pasting the command from cryosparcm command_core and trying to run it from the master directly in the terminal.