Job Halts in Launched State

Hi,

I have installed cryosparc v2.15.0 on two machines with different setup.
The first one is a workstation with 6XGTX2080Ti / 384GB RAM and runs flawlessly.
The second is a workstation with 8xGTX1080Ti / 384GB RAM and shows that weird behavior.

The master program is on the first one and the worker program is on both of them.

As soon as I submit jobs that needed to be done on the second cluster i.e. 2D classification, 3D refinement etc. the jobs are stuck in launched state forever.

Is there any thought what happened here? There is no error message at all, but the program won’t run after all.

Thanks in advance for help.

Regards,
Antony

Hi Antony, this usually happens when there’s a configuration issue on the worker package. You can get more info by checking the internal joblog for the stuck jobs with the joblog command:

First clear and re-run the job. After a few seconds of getting stuck, kill it. Then run this command on the master workstation (substitute PX and JY with the Project ID and Job ID of the stuck job, respectively):

cryosparcm joblog PX JY

Send the output to me for further troubleshooting or let me know if you run into any trouble with this!

Nick

Hi Nick,

Thanks for getting back to me.

I tried to re-run the 2D classification and killed the job when it was stucked.
However when I tried to run the joblog command, it showed this:

Traceback (most recent call last):
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/cryosparc2_compute/client.py", line 83, in <module>
    print eval("cli."+command)
  File "<string>", line 1, in <module>
  File "/home/cryosparc_user/Cryosparc/cryosparc2_master/cryosparc2_compute/client.py", line 57, in func
    assert False, res['error']
AssertionError: {u'message': u"OtherError: argument of type 'NoneType' is not iterable", u'code': 500, u'data': None, u'name': u'OtherError'}

Best,
Antony

@yliucj before you enter the joblog command, please substitute PX and JY for the target project and job IDs.

For example, if the job is in project with ID P2 and the job has ID J748 (which you can read from the cryoSPARC interface, as per the following screenshot)

Then run this command:

cryosparcm joblog P2 J748

Hi Nick,

I tried to run command cryosparcm joblog P23 J19, but the result shows no files remaining.

tail: cannot open ‘/run/media/xiaochun/Data60/101720G5G8ED/P23/J19/job.log’ for reading: No such file or directory
tail: no files remaining

I tired to joblog a worked job. It is ok.

Is there anything I could try?

@nfrasser

I think the job didn’t passed to worker successfully which may explain why there is no job log file.

I double checked the connection between my worker and master. All SSH works well (no pwd needed). All ports are opening. The worker can be corrected binned to master and GPUs were loaded correctly.

Is there anything I can do? Or should I ask for a new license and install as a standalone worker?

Best,
Yang

Hi @yliucj, apologies for the delay on this one. Here’s one last thing to try:

  1. Clear the stuck job so that it goes back into “Building” mode
  2. On the master machine, open a terminal and run this command to start logging
    cryosparcm log command_core
    
    Leave this running in the background as you run the next steps
  3. Back in the cryoSPARC interface, queue the job onto the failing workstation
  4. When it gets stuck into “Launched” mode wait about 15 seconds
  5. Go back to the terminal and press “Ctrl+C” to stop logging
  6. Send over the full output to me

Let me know if you have any trouble with that.

1 Like

You mentioned something about a cluster - do you have cryoSPARC set up in master/worker mode or in cluster mode? If the latter, what kind of cluster are you using? e.g., SLURM or PBS?

If you have a cluster system setup, can you send me the the file called queue_sub_script.sh in the job directory? e.g, it’ll be located at /path/to/projects/P3/J42/queue_sub_script.sh for a job with ID J42 in project P3's selected directory.

Hi @yliucj, any update on this?

Hi @nfrasser,

Thanks for your suggestions. I’m going to work on that this Friday.

Our workstaion techincally is not a cluster. I set it up in master/worker mode. I will try to log command_core first.

I will let you know as soon as possible.

1 Like

@nfrasser

Here is the full output.
It looks like fail to connect to the worker?

"---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway"   

Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
[EXPORT_JOB] : Request to export P35 J2
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J2
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J2/gridfs_data...
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85'), (u'P35', u'J10')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
Licenses currently active : 0
Now trying to schedule J10
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to c07095.dhcp.swmed.org
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P35.J10 status launched
      Running project UID P35 job UID J10
        Running job on worker type node
        Running job using:  /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw
        Running job on remote worker node hostname c07095.dhcp.swmed.org
        cmd: bash -c "nohup /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002 > /run/media/xiaochun/Data63/111320_HW267/P35/J10/job.log 2>&1 & "

---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
Failed to connect link: HTTP Error 502: Bad Gateway
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---- Killing project UID P35 job UID J10
     Killing job on worker type node c07095.dhcp.swmed.org
     Killing job on another worker node hostname c07095.dhcp.swmed.org
Changed job P35.J10 status killed
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
[EXPORT_JOB] : Request to export P35 J10
[EXPORT_JOB] :    Exporting job to /run/media/xiaochun/Data63/111320_HW267/P35/J10
[EXPORT_JOB] :    Exporting all of job's images in the database to /run/media/xiaochun/Data63/111320_HW267/P35/J10/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.00s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P35 J10 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85'), (u'P35', u'J10')]
Licenses currently active : 0
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
Licenses currently active : 0
Now trying to schedule J10
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to c07095.dhcp.swmed.org
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P35.J10 status launched
      Running project UID P35 job UID J10
        Running job on worker type node
        Running job using:  /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw
        Running job on remote worker node hostname c07095.dhcp.swmed.org
        cmd: bash -c "nohup /home/cryosparc/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P35 --job J10 --master_hostname c105053.dhcp.swmed.org --master_command_core_port 39002 > /run/media/xiaochun/Data63/111320_HW267/P35/J10/job.log 2>&1 & "

---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------
---------- Scheduler running ---------------
Jobs Queued:  [(u'P27', u'J85')]
Licenses currently active : 1
Now trying to schedule J85
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status :
    Queue message :
---------- Scheduler finished ---------------