Job Halts in Launched State

yliucj · October 30, 2020, 3:23pm

Hi,

I have installed cryosparc v2.15.0 on two machines with different setup.
The first one is a workstation with 6XGTX2080Ti / 384GB RAM and runs flawlessly.
The second is a workstation with 8xGTX1080Ti / 384GB RAM and shows that weird behavior.

The master program is on the first one and the worker program is on both of them.

As soon as I submit jobs that needed to be done on the second cluster i.e. 2D classification, 3D refinement etc. the jobs are stuck in launched state forever.

Is there any thought what happened here? There is no error message at all, but the program won’t run after all.

Thanks in advance for help.

Regards,
Antony

nfrasser · November 2, 2020, 7:13pm

Hi Antony, this usually happens when there’s a configuration issue on the worker package. You can get more info by checking the internal joblog for the stuck jobs with the joblog command:

First clear and re-run the job. After a few seconds of getting stuck, kill it. Then run this command on the master workstation (substitute PX and JY with the Project ID and Job ID of the stuck job, respectively):

cryosparcm joblog PX JY

Send the output to me for further troubleshooting or let me know if you run into any trouble with this!

Nick

yliucj · November 4, 2020, 9:26pm

Hi Nick,

Thanks for getting back to me.

I tried to re-run the 2D classification and killed the job when it was stucked.
However when I tried to run the joblog command, it showed this:

Traceback (most recent call last):
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/cryosparc2_compute/client.py", line 83, in <module>
    print eval("cli."+command)
  File "<string>", line 1, in <module>
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/cryosparc2_compute/client.py", line 57, in func
    assert False, res['error']
AssertionError: {u'message': u"OtherError: argument of type 'NoneType' is not iterable", u'code': 500, u'data': None, u'name': u'OtherError'}

Best,
Antony

nfrasser · November 5, 2020, 3:56pm

@yliucj before you enter the joblog command, please substitute PX and JY for the target project and job IDs.

For example, if the job is in project with ID P2 and the job has ID J748 (which you can read from the cryoSPARC interface, as per the following screenshot)

Example project and job ID

Then run this command:

cryosparcm joblog P2 J748

yliucj · November 5, 2020, 5:12pm

Hi Nick,

I tried to run command cryosparcm joblog P23 J19, but the result shows no files remaining.

tail: cannot open ‘/run/media/xiaochun/Data60/101720G5G8ED/P23/J19/job.log’ for reading: No such file or directory
tail: no files remaining

I tired to joblog a worked job. It is ok.

Is there anything I could try?

yliucj · November 9, 2020, 4:56pm

@nfrasser

I think the job didn’t passed to worker successfully which may explain why there is no job log file.

I double checked the connection between my worker and master. All SSH works well (no pwd needed). All ports are opening. The worker can be corrected binned to master and GPUs were loaded correctly.

Is there anything I can do? Or should I ask for a new license and install as a standalone worker?

Best,
Yang

nfrasser · November 10, 2020, 5:15pm

Hi @yliucj, apologies for the delay on this one. Here’s one last thing to try:

Clear the stuck job so that it goes back into “Building” mode
On the master machine, open a terminal and run this command to start logging
```
cryosparcm log command_core
```
Leave this running in the background as you run the next steps
Back in the cryoSPARC interface, queue the job onto the failing workstation
When it gets stuck into “Launched” mode wait about 15 seconds
Go back to the terminal and press “Ctrl+C” to stop logging
Send over the full output to me

Let me know if you have any trouble with that.

nfrasser · November 10, 2020, 5:20pm

You mentioned something about a cluster - do you have cryoSPARC set up in master/worker mode or in cluster mode? If the latter, what kind of cluster are you using? e.g., SLURM or PBS?

If you have a cluster system setup, can you send me the the file called queue_sub_script.sh in the job directory? e.g, it’ll be located at /path/to/projects/P3/J42/queue_sub_script.sh for a job with ID J42 in project P3's selected directory.

nfrasser · November 18, 2020, 10:52pm

Hi @yliucj, any update on this?

yliucj · November 19, 2020, 3:18pm

Hi @nfrasser,

Thanks for your suggestions. I’m going to work on that this Friday.

Our workstaion techincally is not a cluster. I set it up in master/worker mode. I will try to log command_core first.

I will let you know as soon as possible.

yliucj · November 20, 2020, 8:15pm

@nfrasser

Here is the full output.
It looks like fail to connect to the worker?

"---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway"   

Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
[EXPORT_JOB] : Request to export P35 J2
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J2
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J2/gridfs_data...
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85'), (u'P35', u'J10')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
Licenses currently active : 0
Now trying to schedule J10
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to c07095.dhcp.swmed.org
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P35.J10 status launched
      Running project UID P35 job UID J10
        Running job on worker type node
        Running job using:  /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw
        Running job on remote worker node hostname c07095.dhcp.swmed.org
        cmd: bash -c "nohup /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002 > /run/media/xiaochun/Data63/111320_HW267/P35/J10/job.log 2>&1 & "

---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---- Killing project UID P35 job UID J10
     Killing job on worker type node c07095.dhcp.swmed.org
     Killing job on another worker node hostname c07095.dhcp.swmed.org
Changed job P35.J10 status killed
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85'), (u'P35', u'J10')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
Licenses currently active : 0
Now trying to schedule J10
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to c07095.dhcp.swmed.org
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P35.J10 status launched
      Running project UID P35 job UID J10
        Running job on worker type node
        Running job using:  /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw
        Running job on remote worker node hostname c07095.dhcp.swmed.org
        cmd: bash -c "nohup /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002 > /run/media/xiaochun/Data63/111320_HW267/P35/J10/job.log 2>&1 & "

---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------

nfrasser · November 30, 2020, 8:32pm

Okay, thanks for trying all of that.

Do you have passwordless SSH set up for the cryosparc user between the master and worker nodes?

To check this, do the following

Log into the master node via terminal, switch to the cryosparc user (or whichever user account starts cryoSPARC.
```
su cryosparc
```
Try to log into the worker node via SSH
```
ssh -T c105053.dhcp.swmed.org
```
Send me the output of this command; press ctrl+c to exit if required.

If you don’t see any errors or verification prompts, try scheduling the job manually with this command:

ssh c105053.dhcp.swmed.org /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002

Wait for it to finish and send me the output.

yliucj · December 1, 2020, 3:56pm

The master node on my side is c105053.dhcp.swmed.org.
The worker node on my side is c07095.dhcp.swmed.org.

I do setup the passwordless SSH between the master and worker. But the username on masternode is cryosparc_user. The username on worker node is cryosparc.

Could this be a problem?

So I tried ssh -T c07095 instead of c105053.

The output is here:

[cryosparc_user@c105053 xiaochun]$ ssh -T c07095.dhcp.swmed.org
cryosparc_user@c07095.dhcp.swmed.org's password:
Permission denied, please try again.

It looks like it trying to connect cryosparc_user@c07095 instead of the correct username cryosparc@07095.

Should I try rename the worker node?

yliucj · December 1, 2020, 4:07pm

By the way I tried to SSH from master node to worker node using

SSH cryosparc@c07095.dhcp.swmed.org

The output is the name/service not known.

So I tried SSH cryosparc@129.112 (ip address). It can successfully connect with passwordless login.

I binned my worker node to my master node using the following command:

bin/cryosparcw connect --worker c07095.dhcp.swmed.org --master c105053.dhcp.swmed.org --port 39000 --ssdpath /scratch --lane cluster3 --newlane

Should I try the following one?

bin/cryosparcw connect --worker cryosparc@ip_address --master c105053.dhcp.swmed.org --port 39000 --ssdpath /scratch --lane cluster3 --newlane

nfrasser · December 2, 2020, 6:18pm

It does look like you have to reconnect the worker using a new username: You just have to specify the --sshstr argument for the full SSH string (leave the --worker flag the same as you had it before):

bin/cryosparcw connect \
    --worker ip_address \
    --master master_ip_address \
    --port 39000 \
    --ssdpath /scratch \
    --lane cluster3 \
    --sshstr cryosparc@ip_address \
    --newlane

You may also change the --newlane argument to --update if you want to update an existing worker hostname.

More info here:

Let me know how that goes.

yliucj · December 2, 2020, 8:21pm

I tried to specify the sshstr as cryosparc@ip_address. But the job still can’t run.

I went back to check the log in command core which shows the same problem.
---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway

I double-checked the passwordless ssh connection which works fine from my master node.

I also tried the proxy. There is no proxy that may block the connection between the two machines.

Is there a way to find out which ssh command does the master use to communicate with the worker? I’m pretty sure that ssh cryosparc@ipaddress works.

nfrasser · December 2, 2020, 10:33pm

The “Failed to connect link” message you see is not related to the worker connection but to cryoSPARC’s license verification server, so that should be fine.

The only thing I can think of that would cause this disparity is that the Linux user account running cryoSPARC is different from the user account you’re using the test the ssh cryosparc@ipaddress connection. That user does not have SSH access.

What is the output of these commands from the master machine?

whoami
ps aux | grep supervisord

And now that you’ve updated your configuration, please also send me a screenshot of the Resource Manager > Instance Information page.

yliucj · December 3, 2020, 7:09pm

Here is the screenshot for lane “cluster3”

The output on master machine is this:

[cryosparc_user@c105053 xiaochun]$ ps aux | grep supervisord
cryospa+  31100  0.0  0.0 139344  3192 ?        Ss   Oct29  31:11 /home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/bin/python /home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/bin/supervisord -c supervisord.conf
cryospa+ 214325  0.0  0.0 112728  1000 pts/2    S+   13:57   0:00 grep --color=auto supervisord

Here is the output on worker:
[cryosparc@c07095 xiaochun]$ ps aux | grep supervisord
cryospa+ 198546  0.0  0.0 112724   984 pts/12   S+   11:05   0:00 grep --color=auto supervisord

stephan · December 18, 2020, 3:41pm

HI @yliucj,

Can you double check if this location exists on the worker node:
/run/media/xiaochun/Data63/111320_HW267/P35/

Can you also run this command now to see if there is any error that shows up:
ssh cryosparc@129.112.52.36 /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002

yliucj · December 23, 2020, 4:16pm

Hi Stephan,

I guess I found the reason. I connected the hard drive to the master node but the worker node.

Let me try to run it. I will let you know whether it works!

Best,
Yang