LSF Cluster: "Redirect Input" operator doesn't work

neeraj · June 1, 2020, 3:06pm

I have added a multi gpu cluster as a lane but i am unable to submit to the cluster. I can pull the commands from cryosparc and submit it manually. It gives a [Errno 2] No such file or directory error. Logs attached any help would be appreciated.

[cryosparc_user@pearl ~]$ cryosparcm log command_core
     Launchable! -- Launching.
Changed job P2.J8 status launched
      Running project UID P2 job UID J8
       Running job on worker type cluster
        cmd: source /admin/lsflilac/lsf/conf/profile.lsf; source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub < /data/hite/cryosparc/P2/J8/queue_sub_script.sh
[JSONRPC ERROR  2020-05-29 01:27:07.428737  at  run_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2092, in run_job
    res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-05-29 01:27:07.430072  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1640, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 1862, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 124, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-05-29 01:27:07.430237  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 4585, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 124, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
---- Killing project UID P2 job UID J8
Changed job P2.J8 status killed
[EXPORT_JOB] : Request to export P2 J8
[EXPORT_JOB] :    Exporting job to /data/hite/cryosparc/P2/J8
[EXPORT_JOB] :    Exporting all of job's images in the database to /data/hite/cryosparc/P2/J8/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P2 J8 in 0.03s
[EXPORT_JOB] : Request to export P2 J8
[EXPORT_JOB] :    Exporting job to /data/hite/cryosparc/P2/J8
[EXPORT_JOB] :    Exporting all of job's images in the database to /data/hite/cryosparc/P2/J8/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P2 J8 in 0.01s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P2', u'J8')]
Licenses currently active : 8
Now trying to schedule J8
  Need slots :  {u'GPU': 8, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to lilac
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P2.J8 status launched
      Running project UID P2 job UID J8
       Running job on worker type cluster
        cmd: source /admin/lsflilac/lsf/conf/profile.lsf; source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub < /data/hite/cryosparc/P2/J8/queue_sub_script.sh
[JSONRPC ERROR  2020-05-29 01:30:42.034596  at  run_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2092, in run_job
    res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-05-29 01:30:42.035631  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1640, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 1862, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 124, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-05-29 01:30:42.035758  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 4585, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 124, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
---- Killing project UID P2 job UID J8
Changed job P2.J8 status killed
[EXPORT_JOB] : Request to export P2 J8
[EXPORT_JOB] :    Exporting job to /data/hite/cryosparc/P2/J8
[EXPORT_JOB] :    Exporting all of job's images in the database to /data/hite/cryosparc/P2/J8/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.05s
[EXPORT_JOB] : Exported P2 J8 in 0.09s
[EXPORT_JOB] : Request to export P2 J8
[EXPORT_JOB] :    Exporting job to /data/hite/cryosparc/P2/J8
[EXPORT_JOB] :    Exporting all of job's images in the database to /data/hite/cryosparc/P2/J8/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P2 J8 in 0.02s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P2', u'J8')]
Licenses currently active : 8
Now trying to schedule J8
  Need slots :  {u'GPU': 8, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to lilac
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P2.J8 status launched
      Running project UID P2 job UID J8
       Running job on worker type cluster
        cmd: source /admin/lsflilac/lsf/conf/profile.lsf; source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub < /data/hite/cryosparc/P2/J8/queue_sub_script.sh
[JSONRPC ERROR  2020-06-01 10:50:15.874862  at  run_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2092, in run_job
    res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/admin/opt/common/cryosparc/software/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-06-01 10:50:15.877183  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1640, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 1862, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 124, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-06-01 10:50:15.877454  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 4585, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 124, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------

stephan · June 9, 2020, 2:21pm

Hi @neeraj,

It looks like one of the files listed here was not found by cryoSPARC when attempting to submit the job to the cluster. Is it possible if you can check to make sure that each of these files exist? Please note that the file /data/hite/cryosparc/P2/J8/queue_sub_script.sh is generated dynamically based on the provided cluster_info.json and cluster_script.sh for each job you try to run.

neeraj · June 9, 2020, 5:15pm

Stephan,

Thank you so much for your response. I can confirm that those files do exist and the queue_sub_script.sh is dynamically created. I am also able to run the job using the above. I made a few modifications to the cluster_info.json so its clearer. Please see below

---------- Scheduler running ---------------
Jobs Queued:  [(u'P1', u'J11')]
Licenses currently active : 1
Now trying to schedule J11
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 2}
  Need fixed :  {u'SSD': False}
  Master direct :  False
   Scheduling job to lilac
Insecure mode - no SSL in license check
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P1.J11 status launched
      Running project UID P1 job UID J11
       Running job on worker type cluster
        cmd: source /admin/lsflilac/lsf/conf/profile.lsf; bsub < /lila/data/hite/cryosparc/P2/J11/queue_sub_script.sh
[JSONRPC ERROR  2020-06-09 13:08:45.055479  at  run_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 115, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 2092, in run_job
    res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT)
  File "/admin/opt/common/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/admin/opt/common/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/admin/opt/common/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
-----------------------------------------------------

If run the above command manually the job does start. See below 
[cryosparc_user@pearl ~]$ source /admin/lsflilac/lsf/conf/profile.lsf; bsub < /lila/data/hite/cryosparc/P2/J11/queue_sub_script.sh
Job <14813473> is submitted to queue <gpuqueue>.
[cryosparc_user@pearl ~]$ bjobs
       JOBID       USER     JOB_NAME   STAT      QUEUE  FROM_HOST    EXEC_HOST   SUBMIT_TIME    START_TIME  TIME_LEFT
    14813473 cryosparc_ *rc_username   PEND   gpuqueue      pearl      -        Jun  9 13:10       -           -
[cryosparc_user@pearl ~]$ bjobs
       JOBID       USER     JOB_NAME   STAT      QUEUE  FROM_HOST    EXEC_HOST   SUBMIT_TIME    START_TIME  TIME_LEFT
    14813473 cryosparc_ *rc_username    RUN   gpuqueue      pearl       8*lw02  Jun  9 13:10  Jun  9 13:10     36:0 L

stephan · June 9, 2020, 5:44pm

Hi @neeraj,

Thanks for trying that! This helps. It might possibly be a permissions/location issue.
Can you run cryosparcm status and paste the output (please censor your License ID)?

neeraj · June 9, 2020, 7:03pm

[cryosparc_user@pearl ~]$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/admin/opt/common/cryosparc/cryosparc2_master
Current cryoSPARC version: v2.15.0
----------------------------------------------------------------------------

cryosparcm process status:

app                              STOPPED   Not started
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 42757, uptime 1 day, 1:53:12
command_proxy                    RUNNING   pid 42834, uptime 1 day, 1:53:05
command_rtp                      STOPPED   Not started
command_vis                      RUNNING   pid 42825, uptime 1 day, 1:53:07
database                         RUNNING   pid 42677, uptime 1 day, 1:53:15
watchdog_dev                     STOPPED   Not started
webapp                           RUNNING   pid 42840, uptime 1 day, 1:53:04
webapp_dev                       STOPPED   Not started

----------------------------------------------------------------------------

global config variables:

export CRYOSPARC_LICENSE_ID=""
export CRYOSPARC_MASTER_HOSTNAME="pearl.hpc.private"
export CRYOSPARC_DB_PATH="/opt/common/cryosparc/cryosparc2_database/"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=true
export CRYOSPARC_CLICK_WRAP=true

Thanks
Neeraj

stephan · June 10, 2020, 12:16am

Hi @neeraj,

Now I have a feeling it has to do with the actual submission command. Is it also possible if I can take a look at your cluster_info.json and your cluster_script.sh? You can get the latest versions of these files by running the command cryosparcm cluster dump (the files will be written to the current working directory). Thanks!

neeraj · June 10, 2020, 4:47am

Stephan,
Please see below
cluster_info.json

    "qdel_cmd_tpl": "source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bkill {{ cluster_job_id }}",
    "worker_bin_path": "/opt/common/cryosparc/cryosparc2_worker/bin/cryosparcw",
    "title": "lilac",
    "cache_path": "/scratch/",
    "qinfo_cmd_tpl": "source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bqueue",
    "qsub_cmd_tpl": "source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub < {{ script_path_abs }}",
    "qstat_cmd_tpl": "source /admin/lsflilac/lsf/conf/profile.lsf; /admin/lsflilac/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}",
    "cache_quota_mb": null,
    "send_cmd_tpl": "source /admin/lsflilac/lsf/conf/profile.lsf; {{ command }}",
    "cache_reserve_mb": 10000,
    "name": "lilac"

cluster_script.sh

    #!/bin/bash
    #### cryoSPARC cluster submission script template for LSF
    ## Available variables:
    ## {{ run_cmd }}            - the complete command string to run the job
    ## {{ num_cpu }}            - the number of CPUs needed
    ## {{ num_gpu }}            - the number of GPUs needed.
    ##                            Note: the code will use this many GPUs starting from dev id 0
    ##                                  the cluster scheduler or this script have the responsibility
    ##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
    ##                                  using the correct cluster-allocated GPUs.
    ## {{ ram_gb }}             - the amount of RAM needed in GB
    ## {{ job_dir_abs }}        - absolute path to the job directory
    ## {{ project_dir_abs }}    - absolute path to the project dir
    ## {{ job_log_path_abs }}   - absolute path to the log file for the job
    ## {{ worker_bin_path }}    - absolute path to the cryosparc worker command
    ## {{ run_args }}           - arguments to be passed to cryosparcw run
    ## {{ project_uid }}        - uid of the project
    ## {{ job_uid }}            - uid of the job
    ## {{ job_creator }}        - name of the user that created the job (may contain spaces)
    ## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)
    ##
    ## What follows is a simple LSF script:
    #BSUB -J cryosparc_{{ project_uid }}_{{ job_uid }}_{{ cryosparc_username }}
    #BSUB -q gpuqueue
    ###BSUB -e {{ job_dir_abs }}/%J.err
    ###BSUB -o {{ job_dir_abs }}/%J.out
    #BSUB -n 8
    #BSUB -R "span[ptile=8]"
    #BSUB -R "rusage[mem={{ ram_gb }}]"
    #BSUB -gpu "num=8:j_exclusive=yes:mode=shared"
    #BSUB -W 36:00
    ##BSUB -m lp-gpu ls-gpu lt-gpu


    ##Load modules

    {{ run_cmd }}

yy314 · June 25, 2020, 1:36pm

Hi, has your problem solved? I have the same problem as well.
Thanks,
Yahui

neeraj · June 25, 2020, 2:27pm

No we are still having the issue. Let us know if you make any progress.

Thanks
Neeraj

stephan · July 29, 2020, 2:34am

Hi @neeraj,

I wasn’t able to figure out what’s going wrong- this issue may be specific to your environment’s configuration. If you’d like, I’d be able to hop on a call to help you sort this out; send me an email at [address redacted] if you’re interested.

stephan · October 20, 2020, 4:56pm

Hi @yy314,

We unknowingly continued this conversation in a new thread- @neeraj posted a workaround that you can use for now: