Hi,
I have a very similar problem at @yliucj. I have run all the commands as suggested and unfortunately cannot tell where the problem is.
I run the command on master node that would correspond to my setup:
ssh cryosparc_user@10.0.90.38 /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname 10.0.90.57 --master_command_core_port 39002
As a result in the command line, I get:
================= CRYOSPARCW ======= 2021-03-31 19:44:30.222932 ========= Project P29 Job J22 Master 10.0.90.57 Port 39002 =========================================================================== ========= monitor process now starting main process MAINPROCESS PID 187193 ========= monitor process now waiting for main process MAIN PID 187193 helix.run_refine cryosparc_compute.jobs.jobregister *************************************************************** Running job J22 of type helix_refine Running job on hostname %s joao Allocated Resources : {'fixed': {'SSD': True}, 'hostname': 'joao', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/mnt/SSD1/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554324480, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'joao', 'lane': 'default', 'monitor_port': None, 'name': 'joao', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@10.0.90.38', 'title': 'Worker node joao', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}} **** handle exception rc set status to failed ========= main process now complete. ========= monitor process now complete.
In the project in the browser, I got this error:
[CPU: 213.8 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py”, line 220, in cryosparc_compute.jobs.helix.run_refine.run
File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/particles.py”, line 31, in init
self.from_dataset(d) # copies in all data
File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/dataset.py”, line 473, in from_dataset
if len(other) == 0: return self
TypeError: object of type ‘NoneType’ has no len()
What is interesting, if I schedule a job from the browser, there is nothing happening, and job is just halted.
EDIT:
I checked cryosparcm log command this is what I get:
Jobs Queued: [(‘P29’, ‘J22’)]
Licenses currently active : 0
Now trying to schedule J22
Need slots : {‘CPU’: 4, ‘GPU’: 1, ‘RAM’: 3}
Need fixed : {‘SSD’: True}
Master direct : False
Running job directly on GPU id(s): [0] on joao
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
Launchable! – Launching.
Changed job P29.J22 status launched
Running project UID P29 job UID J22
Running job on worker type node
Running job using: /home/cryosparc_user/cryosparc_worker/bin/cryosparcw
Running job on remote worker node hostname joao
cmd: bash -c "nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname chris --master_command_core_port 39002 > /mnt/12T_HDD1/P29/J22/job.log 2>&1 & "
Is it possible that master_hostname is not correct and the IP address should be there instead?
If you had any idea how to solve that, please let me know!
EDIT2:
I was following the cryosparcm log command_core
when Master was sending a job to the worker, but it seems like that worker never receives the command or there is a different issue. I tried to run the corresponding command on the worker, but error occurred:
nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname 10.0.90.57 --master_command_core_port 39002 > /mnt/12T_HDD1/P29/J22/job.log 2>&1
-bash: /mnt/12T_HDD1//P29/J22/job.log: No such file or directory
When I removed 2>&1
, command went further, but crashed anyway:
================= CRYOSPARCW ======= 2021-04-01 14:31:10.613405 =========
Project P29 Job J22
Master 10.0.90.57 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 199827
MAIN PID 199827
========= monitor process now waiting for main process
helix.run_refine cryosparc_compute.jobs.jobregister
***************************************************************
Running job J22 of type helix_refine
Running job on hostname %s joao
Allocated Resources : {'fixed': {'SSD': True}, 'hostname': 'joao', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/mnt/SSD1/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554324480, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'joao', 'lane': 'default', 'monitor_port': None, 'name': 'joao', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@10.0.90.38', 'title': 'Worker node joao', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
Traceback (most recent call last):
File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
File "cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py", line 220, in cryosparc_compute.jobs.helix.run_refine.run
File "/home/cryosparc_user/cryosparc_worker/cryosparc_compute/particles.py", line 31, in init
self.from_dataset(d) # copies in all data
File "/home/cryosparc_user/cryosparc_worker/cryosparc_compute/dataset.py", line 473, in from_dataset
if len(other) == 0: return self
TypeError: object of type 'NoneType' has no len()
========= main process now complete.
========= monitor process now complete.
The same job run on the master node works perfectly fine.