Job Halts in Launched State

@nfrasser

I think the job didn’t passed to worker successfully which may explain why there is no job log file.

I double checked the connection between my worker and master. All SSH works well (no pwd needed). All ports are opening. The worker can be corrected binned to master and GPUs were loaded correctly.

Is there anything I can do? Or should I ask for a new license and install as a standalone worker?

Best,
Yang

Hi @yliucj, apologies for the delay on this one. Here’s one last thing to try:

  1. Clear the stuck job so that it goes back into “Building” mode
  2. On the master machine, open a terminal and run this command to start logging
    cryosparcm log command_core
    
    Leave this running in the background as you run the next steps
  3. Back in the cryoSPARC interface, queue the job onto the failing workstation
  4. When it gets stuck into “Launched” mode wait about 15 seconds
  5. Go back to the terminal and press “Ctrl+C” to stop logging
  6. Send over the full output to me

Let me know if you have any trouble with that.

1 Like

You mentioned something about a cluster - do you have cryoSPARC set up in master/worker mode or in cluster mode? If the latter, what kind of cluster are you using? e.g., SLURM or PBS?

If you have a cluster system setup, can you send me the the file called queue_sub_script.sh in the job directory? e.g, it’ll be located at /path/to/projects/P3/J42/queue_sub_script.sh for a job with ID J42 in project P3's selected directory.

Hi @yliucj, any update on this?

Hi @nfrasser,

Thanks for your suggestions. I’m going to work on that this Friday.

Our workstaion techincally is not a cluster. I set it up in master/worker mode. I will try to log command_core first.

I will let you know as soon as possible.

1 Like

@nfrasser

Here is the full output.
It looks like fail to connect to the worker?

"---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway"   

Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
[EXPORT_JOB] : Request to export P35 J2
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J2
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J2/gridfs_data...
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85'), (u'P35', u'J10')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
Licenses currently active : 0
Now trying to schedule J10
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to c07095.dhcp.swmed.org
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P35.J10 status launched
      Running project UID P35 job UID J10
        Running job on worker type node
        Running job using:  /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw
        Running job on remote worker node hostname c07095.dhcp.swmed.org
        cmd: bash -c "nohup /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002 > /run/media/xiaochun/Data63/111320_HW267/P35/J10/job.log 2>&1 & "

---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---- Killing project UID P35 job UID J10
     Killing job on worker type node c07095.dhcp.swmed.org
     Killing job on another worker node hostname c07095.dhcp.swmed.org
Changed job P35.J10 status killed
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85'), (u'P35', u'J10')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
Licenses currently active : 0
Now trying to schedule J10
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to c07095.dhcp.swmed.org
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P35.J10 status launched
      Running project UID P35 job UID J10
        Running job on worker type node
        Running job using:  /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw
        Running job on remote worker node hostname c07095.dhcp.swmed.org
        cmd: bash -c "nohup /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002 > /run/media/xiaochun/Data63/111320_HW267/P35/J10/job.log 2>&1 & "

---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------

Okay, thanks for trying all of that.

Do you have passwordless SSH set up for the cryosparc user between the master and worker nodes?

To check this, do the following

  1. Log into the master node via terminal, switch to the cryosparc user (or whichever user account starts cryoSPARC.
    su cryosparc
    
  2. Try to log into the worker node via SSH
    ssh -T c105053.dhcp.swmed.org
    
  3. Send me the output of this command; press ctrl+c to exit if required.

If you don’t see any errors or verification prompts, try scheduling the job manually with this command:

ssh c105053.dhcp.swmed.org /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002

Wait for it to finish and send me the output.

The master node on my side is c105053.dhcp.swmed.org.
The worker node on my side is c07095.dhcp.swmed.org.

I do setup the passwordless SSH between the master and worker. But the username on masternode is cryosparc_user. The username on worker node is cryosparc.

Could this be a problem?

So I tried ssh -T c07095 instead of c105053.

The output is here:

[cryosparc_user@c105053 xiaochun]$ ssh -T c07095.dhcp.swmed.org
cryosparc_user@c07095.dhcp.swmed.org's password:
Permission denied, please try again.

It looks like it trying to connect cryosparc_user@c07095 instead of the correct username cryosparc@07095.

Should I try rename the worker node?

By the way I tried to SSH from master node to worker node using

SSH cryosparc@c07095.dhcp.swmed.org

The output is the name/service not known.

So I tried SSH cryosparc@129.112 (ip address). It can successfully connect with passwordless login.

I binned my worker node to my master node using the following command:

bin/cryosparcw connect --worker c07095.dhcp.swmed.org --master c105053.dhcp.swmed.org --port 39000 --ssdpath /scratch --lane cluster3 --newlane	

Should I try the following one?

bin/cryosparcw connect --worker cryosparc@ip_address --master c105053.dhcp.swmed.org --port 39000 --ssdpath /scratch --lane cluster3 --newlane

It does look like you have to reconnect the worker using a new username: You just have to specify the --sshstr argument for the full SSH string (leave the --worker flag the same as you had it before):

bin/cryosparcw connect \
    --worker ip_address \
    --master master_ip_address \
    --port 39000 \
    --ssdpath /scratch \
    --lane cluster3 \
    --sshstr cryosparc@ip_address \
    --newlane

You may also change the --newlane argument to --update if you want to update an existing worker hostname.

More info here:

Let me know how that goes.

I tried to specify the sshstr as cryosparc@ip_address. But the job still can’t run.

I went back to check the log in command core which shows the same problem.
---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway

I double-checked the passwordless ssh connection which works fine from my master node.

I also tried the proxy. There is no proxy that may block the connection between the two machines.

Is there a way to find out which ssh command does the master use to communicate with the worker? I’m pretty sure that ssh cryosparc@ipaddress works.

The “Failed to connect link” message you see is not related to the worker connection but to cryoSPARC’s license verification server, so that should be fine.

The only thing I can think of that would cause this disparity is that the Linux user account running cryoSPARC is different from the user account you’re using the test the ssh cryosparc@ipaddress connection. That user does not have SSH access.

What is the output of these commands from the master machine?

whoami
ps aux | grep supervisord

And now that you’ve updated your configuration, please also send me a screenshot of the Resource Manager > Instance Information page.

Here is the screenshot for lane “cluster3”

The output on master machine is this:

[cryosparc_user@c105053 xiaochun]$ ps aux | grep supervisord
cryospa+  31100  0.0  0.0 139344  3192 ?        Ss   Oct29  31:11 /home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/bin/python /home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/bin/supervisord -c supervisord.conf
cryospa+ 214325  0.0  0.0 112728  1000 pts/2    S+   13:57   0:00 grep --color=auto supervisord

Here is the output on worker:
[cryosparc@c07095 xiaochun]$ ps aux | grep supervisord
cryospa+ 198546  0.0  0.0 112724   984 pts/12   S+   11:05   0:00 grep --color=auto supervisord

HI @yliucj,

Can you double check if this location exists on the worker node:
/run/media/xiaochun/Data63/111320_HW267/P35/

Can you also run this command now to see if there is any error that shows up:
ssh cryosparc@129.112.52.36 /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002

Hi Stephan,

I guess I found the reason. I connected the hard drive to the master node but the worker node.

Let me try to run it. I will let you know whether it works!

Best,
Yang

Hi,

I have a very similar problem at @yliucj. I have run all the commands as suggested and unfortunately cannot tell where the problem is.

I run the command on master node that would correspond to my setup:

ssh cryosparc_user@10.0.90.38 /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname 10.0.90.57 --master_command_core_port 39002

As a result in the command line, I get:

================= CRYOSPARCW =======  2021-03-31 19:44:30.222932  =========
Project P29 Job J22
Master 10.0.90.57 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 187193
========= monitor process now waiting for main process
MAIN PID 187193
helix.run_refine cryosparc_compute.jobs.jobregister
***************************************************************
Running job  J22  of type  helix_refine
Running job on hostname %s joao
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'joao', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/mnt/SSD1/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554324480, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'joao', 'lane': 'default', 'monitor_port': None, 'name': 'joao', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@10.0.90.38', 'title': 'Worker node joao', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

In the project in the browser, I got this error:

[CPU: 213.8 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py”, line 220, in cryosparc_compute.jobs.helix.run_refine.run
File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/particles.py”, line 31, in init
self.from_dataset(d) # copies in all data
File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/dataset.py”, line 473, in from_dataset
if len(other) == 0: return self
TypeError: object of type ‘NoneType’ has no len()

What is interesting, if I schedule a job from the browser, there is nothing happening, and job is just halted.

EDIT:

I checked cryosparcm log command this is what I get:

Jobs Queued: [(‘P29’, ‘J22’)]
Licenses currently active : 0
Now trying to schedule J22
Need slots : {‘CPU’: 4, ‘GPU’: 1, ‘RAM’: 3}
Need fixed : {‘SSD’: True}
Master direct : False
Running job directly on GPU id(s): [0] on joao
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
Launchable! – Launching.
Changed job P29.J22 status launched
Running project UID P29 job UID J22
Running job on worker type node
Running job using: /home/cryosparc_user/cryosparc_worker/bin/cryosparcw
Running job on remote worker node hostname joao
cmd: bash -c "nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname chris --master_command_core_port 39002 > /mnt/12T_HDD1/P29/J22/job.log 2>&1 & "

Is it possible that master_hostname is not correct and the IP address should be there instead?

If you had any idea how to solve that, please let me know!

EDIT2:
I was following the cryosparcm log command_core when Master was sending a job to the worker, but it seems like that worker never receives the command or there is a different issue. I tried to run the corresponding command on the worker, but error occurred:

nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P29 --job J22 --master_hostname 10.0.90.57 --master_command_core_port 39002 > /mnt/12T_HDD1/P29/J22/job.log 2>&1 
-bash: /mnt/12T_HDD1//P29/J22/job.log: No such file or directory

When I removed 2>&1, command went further, but crashed anyway:

================= CRYOSPARCW =======  2021-04-01 14:31:10.613405  =========
Project P29 Job J22
Master 10.0.90.57 Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 199827
MAIN PID 199827
========= monitor process now waiting for main process
helix.run_refine cryosparc_compute.jobs.jobregister
***************************************************************
Running job  J22  of type  helix_refine
Running job on hostname %s joao
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'joao', 'lane': 'default', 'lane_type': 'default', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/mnt/SSD1/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11554324480, 'name': 'GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'GeForce RTX 2080 Ti'}], 'hostname': 'joao', 'lane': 'default', 'monitor_port': None, 'name': 'joao', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], 'GPU': [0, 1], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc_user@10.0.90.38', 'title': 'Worker node joao', 'type': 'node', 'worker_bin_path': '/home/cryosparc_user/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/helix/run_refine.py", line 220, in cryosparc_compute.jobs.helix.run_refine.run
  File "/home/cryosparc_user/cryosparc_worker/cryosparc_compute/particles.py", line 31, in init
    self.from_dataset(d) # copies in all data
  File "/home/cryosparc_user/cryosparc_worker/cryosparc_compute/dataset.py", line 473, in from_dataset
    if len(other) == 0: return self
TypeError: object of type 'NoneType' has no len()
========= main process now complete.
========= monitor process now complete.

The same job run on the master node works perfectly fine.

@dzyla I don’t know if you’re still having this issue but I came across your post because I ran into the exact same problems. Some of the ports were already open but opening the remaining ones from 39000-39005 on the master fixed everything.

Hi @RyanFeathers,

Thank you so much for the tip. I have opened ports 39000-39008 on both machines but unfortunately, the problem remains: The job starts but halts forever. Both machines have a passwordless connection that was tested but somehow cryosparc does not reach the worker.

@dzyla I’m sorry to hear that didn’t work for you. My error logs were almost identical to everything you posted and as soon as I opened the last port the stalled job started.

One last thing I noticed though was the error about the path. Are you sure that the drive where the data is located is accessible (w/r/x) to both machines? That was an earlier issue I had as well.

Just noting that I had this exact problem with a worker, the issue in my case was that the central storage drive had not properly mounted on that worker after a restart. Everything was restored to working order once this drive was remounted properly. cryosparcm joblog didn’t provide a lot of information (the log file was never created since the worker couldn’t find the directory) but I eventually diagnosed by copy-pasting the command from cryosparcm command_core and trying to run it from the master directly in the terminal.