Error while running a job: "non-zero exit status 255"

Hi,

I’m running the tutorial dataset. The master and worker on the same computer.
The import work but on the 2nd stap I get this error:

Command ‘[‘ssh’, u’emguest@cryogpu2’, ‘nohup’, u’/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P1 --job J6 --master_hostname cryogpu2 --master_command_core_port 39002 > /net/cryogpu/data/cryosparc_output_data/J6/job.log 2>&1 & ‘]’ returned non-zero exit status 255

Can you help me find a fix for it?

Thanks,

Hi,

Please take a look at the post I made here.

Thanks,
Stephan

Thanks for your help.
I have done the quick install and my master/worker on the same node. Do I need the ssh-key?

I have setup the ssh-key and now get this error:

Command ‘[‘ssh’, u’emguest@cryogpu2’, ‘nohup’, u’/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P1 --job J10 --master_hostname cryogpu2 --master_command_core_port 39002 > /net/cryogpu/data/cryosparc_output_data/J10/job.log 2>&1 & ‘]’ returned non-zero exit status 1’,

Is it possible if you can show me the output of cryosparcm log command_core?

Is your default shell bash, or tcsh?

bash-4.2$ cryosparcm log command_core
  File "/home/emguest/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 219, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P1 --job J10 --master_hostname cryogpu2 --master_command_core_port 39002 > /net/cryogpu/data/cryosparc_output_data/J10/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
[JSONRPC ERROR  2018-08-21 12:50:37.226600  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 814, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 951, in scheduler_run_core
    run_job(j['project_uid'], j['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 100, in wrapper
    raise e
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P1 --job J10 --master_hostname cryogpu2 --master_command_core_port 39002 > /net/cryogpu/data/cryosparc_output_data/J10/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
[JSONRPC ERROR  2018-08-21 12:50:37.226949  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2456, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 100, in wrapper
    raise e
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P1 --job J10 --master_hostname cryogpu2 --master_command_core_port 39002 > /net/cryogpu/data/cryosparc_output_data/J10/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
4 -1 [u'', u'ssd1', u'emguest', u'empiar_10025_subset', u'*.tif']
Indexing /ssd1/emguest/empiar_10025_subset/*.tif ----
Base path:  /ssd1/emguest/empiar_10025_subset
Setting parameter J3.blob_paths with value /ssd1/emguest/empiar_10025_subset/*.tif of type <type 'str'>
4 -1 [u'', u'ssd1', u'emguest', u'empiar_10025_subset', u'*.tif']
Indexing /ssd1/emguest/empiar_10025_subset/*.tif ----
Base path:  /ssd1/emguest/empiar_10025_subset
4 -1 [u'', u'ssd1', u'emguest', u'empiar_10025_subset', u'*.mrc']
Indexing /ssd1/emguest/empiar_10025_subset/*.mrc ----
Base path:  /ssd1/emguest/empiar_10025_subset
Indexing /ssd1/emguest/empiar_10025_subset/ ----
Base path:  /ssd1/emguest/empiar_10025_subset/
4 -1 [u'', u'ssd1', u'emguest', u'empiar_10025_subset', u'*.mrc']
Indexing /ssd1/emguest/empiar_10025_subset/*.mrc ----
Base path:  /ssd1/emguest/empiar_10025_subset
4 -1 [u'', u'ssd1', u'emguest', u'empiar_10025_subset', u'*.mrc']
Indexing /ssd1/emguest/empiar_10025_subset/*.mrc ----
Base path:  /ssd1/emguest/empiar_10025_subset
Indexing /ssd1/emguest/empiar_10025_subset/norm-amibox05-0.mrc ----
Base path:  /ssd1/emguest/empiar_10025_subset
Setting parameter J3.gainref_path with value /ssd1/emguest/empiar_10025_subset/norm-amibox05-0.mrc of type <type 'str'>
Setting parameter J3.psize_A with value 0.6575 of type <type 'float'>
Setting parameter J3.accel_kv with value 300 of type <type 'int'>
Setting parameter J3.cs_mm with value 2.7 of type <type 'float'>
Setting parameter J3.total_dose_e_per_A2 with value None of type <type 'NoneType'>
Setting parameter J3.total_dose_e_per_A2 with value 53 of type <type 'int'>
---------- Scheduler running ---------------
Lane  default node : Jobs Queued (nonpaused, inputs ready):  [u'J3']
Total slots:  {u'cryogpu2': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])}}
Available slots:  {u'cryogpu2': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])}}
Available licen:  10000
Now trying to schedule J3
  Need slots :  {}
  Need fixed :  {}
  Need licen :  False
  Master direct :  True
---- Running project UID P2 job UID J3
License Data:  {"token": "xxxxxxxxxxxxxxxxxxx", "token_valid": true, "request_date": 1534875360, "license_valid": true}
License Signature:  xxxxxxxxxxxxxxxx
     Running job on master node directly
     Running job using:  /home/emguest/software/cryosparc2_master/bin/cryosparcm
Changed job P2.J3 status launched
---------- Scheduler done ------------------
Changed job P2.J3 status started
Changed job P2.J3 status running
Changed job P2.J3 status completed
---------- Scheduler running ---------------
Lane  default node : Jobs Queued (nonpaused, inputs ready):  [u'J4']
Total slots:  {u'cryogpu2': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])}}
Available slots:  {u'cryogpu2': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])}}
Available licen:  10000
Now trying to schedule J4
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 6}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on cryogpu2
    Launchable:  True
    Alloc slots :  {u'GPU': [0], u'RAM': [0, 1], u'CPU': [0, 1, 2, 3, 4, 5]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
---- Running project UID P2 job UID J4
failed to connect link
License Data:  {"token": "xxxxxxxxxxxxxxxxxxxx", "token_valid": true, "request_date": 1534875455, "license_valid": true}
License Signature:  xxxxxxxxxxxxxxxxxxxx
     Running job on worker type node
     Running job using:  /home/emguest/software/cryosparc2_worker/bin/cryosparcw
     Running job on remote worker node hostname cryogpu2
     cmd: /home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J4 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J4/job.log 2>&1 &
Changed job P2.J4 status failed
**************
FAILED TO LAUNCH ON WORKER NODE return code 1
Ambiguous output redirect.

[JSONRPC ERROR  2018-08-21 13:17:38.454887  at  run_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1115, in run_job
    print subprocess.check_output(['ssh', ssh_str, 'nohup', cmd], stderr=subprocess.STDOUT)
  File "/home/emguest/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 219, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J4 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J4/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
[JSONRPC ERROR  2018-08-21 13:17:38.465043  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 814, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 951, in scheduler_run_core
    run_job(j['project_uid'], j['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 100, in wrapper
    raise e
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J4 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J4/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
[JSONRPC ERROR  2018-08-21 13:17:38.466132  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2456, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 100, in wrapper
    raise e
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J4 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J4/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
---- Deleting project UID P2 job UID J4
     Now clearing job..
*** LINK WORKER START
COMMAND CORE STARTED ===  2018-08-21 16:44:04.423759  ==========================
*** BG WORKER START
---------- Scheduler running ---------------
Lane  default node : Jobs Queued (nonpaused, inputs ready):  [u'J5']
Total slots:  {u'cryogpu2': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])}}
Available slots:  {u'cryogpu2': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]), u'CPU': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])}}
Available licen:  10000
Now trying to schedule J5
  Need slots :  {u'GPU': 1, u'RAM': 2, u'CPU': 6}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on cryogpu2
    Launchable:  True
    Alloc slots :  {u'GPU': [0], u'RAM': [0, 1], u'CPU': [0, 1, 2, 3, 4, 5]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
---- Running project UID P2 job UID J5
failed to connect link
License Data:  {"token": "xxxxxxxxxxxxxxxxx", "token_valid": true, "request_date": 1534887957, "license_valid": true}
License Signature:  xxxxxxxxxxxxxxxxxxxxxx
     Running job on worker type node
     Running job using:  /home/emguest/software/cryosparc2_worker/bin/cryosparcw
     Running job on remote worker node hostname cryogpu2
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
     cmd: /home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J5 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J5/job.log 2>&1 &
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1115, in run_job
    print subprocess.check_output(['ssh', ssh_str, 'nohup', cmd], stderr=subprocess.STDOUT)
  File "/home/emguest/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 219, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J5 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J5/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
[JSONRPC ERROR  2018-08-21 16:46:00.608672  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 814, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 951, in scheduler_run_core
    run_job(j['project_uid'], j['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 100, in wrapper
    raise e
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J5 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J5/job.log 2>&1 & ']' returned non-zero exit status 1
-----------------------------------------------------
[JSONRPC ERROR  2018-08-21 16:46:00.608892  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 91, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2456, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 100, in wrapper
    raise e
CalledProcessError: Command '['ssh', u'emguest@cryogpu2', 'nohup', u'/home/emguest/software/cryosparc2_worker/bin/cryosparcw run --project P2 --job J5 --master_hostname cryogpu2 --master_command_core_port 39002 > /ssd1/emguest/J5/job.log 2>&1 & ']' returned non-zero exit status 1

This question did not get answered. If the master and worker are the same node, do we need the SSH-key?

Please take a look at this post here.

Setting up ssh-keys would be your last resort.