Failure in 2.14.2 upgrade

BrianCuttler · February 24, 2020, 4:54pm

In order to attempt to fix the “I/O error when starting job” problem we were seeing in 2.12.4 I tried to update Cryosparc.

We have a failure updating the working, and I’m unclear on where the tar.gz download is, though I should be able to download install/configure it separately.

The good news is that I think the update to the master fixed the issue we were hoping to fix.

cryosparc_user@shiva:~$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/install/cryosparc_user/software/cryosparc/cryosparc2_master
Current cryoSPARC version: v2.12.4
----------------------------------------------------------------------------

cryosparcm process status:

app                              STOPPED   Not started
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 1795, uptime 0:22:19
command_proxy                    RUNNING   pid 1822, uptime 0:22:16
command_rtp                      STOPPED   Not started
command_vis                      STARTING
database                         RUNNING   pid 1719, uptime 0:22:21
watchdog_dev                     STOPPED   Not started
webapp                           RUNNING   pid 1826, uptime 0:22:14
webapp_dev                       STOPPED   Not started

----------------------------------------------------------------------------

global config variables:

export CRYOSPARC_MASTER_HOSTNAME="shiva.wadsworth.org"
export CRYOSPARC_DB_PATH="/data"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false

cryosparc_user@shiva:~$ cryosparcm ?
Unknown command ?
cryosparc_user@shiva:~$ cryosparcm upgrade
Unknown command upgrade
cryosparc_user@shiva:~$ cryosparcm update
CryoSPARC current version v2.12.4
          update starting on Mon Feb 24 11:40:42 EST 2020

No version specified - updating to latest version.

=============================
Updating to version v2.14.2.
=============================
CryoSPARC is running.
Stopping cryosparc.
command_proxy: stopped
command_vis: stopped
webapp: stopped
command_core: stopped
database: stopped
Shut down
  Downloading master update...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  555M  100  555M    0     0  14.8M      0  0:00:37  0:00:37 --:--:-- 26.2M
  Downloading worker update...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  624M  100  624M    0     0  21.3M      0  0:00:29  0:00:29 --:--:-- 16.4M
  Done.

 Update will now be applied to the master installation,
 followed by worker installations on other node.

  Deleting old files...
  Extracting...
  Done.
  Updating dependencies...
  Checking dependencies...
  Dependencies for python have not changed.
  Currently checking hash for mongodb
  Dependencies for mongodb have not changed.
  Completed dependency check.

===================================================
Successfully updated master from version v2.12.4 to version v2.14.2.
===================================================

Starting cryoSPARC System master process..
CryoSPARC is not already running.
database: started
command_core: started
  cryosparc command core startup complete.
command_vis: started
command_proxy: started
webapp: started
-----------------------------------------------------

CryoSPARC master started.
 From this machine, access cryoSPARC at
    http://localhost:39000

 From other machines on the network, access cryoSPARC at
    http://shiva.wadsworth.org:39000


Startup can take several minutes. Point your browser to the address
and refresh until you see the cryoSPARC web interface.
CryoSPARC is running.
Stopping cryosparc.
command_proxy: stopped
command_vis: stopped
webapp: stopped
command_core: stopped
database: stopped
Shut down
Starting cryoSPARC System master process..
CryoSPARC is not already running.
database: started
command_core: started
  cryosparc command core startup complete.
command_vis: started
command_proxy: started
webapp: started
-----------------------------------------------------

CryoSPARC master started.
 From this machine, access cryoSPARC at
    http://localhost:39000

 From other machines on the network, access cryoSPARC at
    http://shiva.wadsworth.org:39000


Startup can take several minutes. Point your browser to the address
and refresh until you see the cryoSPARC web interface.

 ===================================================
 Now updating worker nodes.

All workers:
shiva.wadsworth.org cryosparc_user@shiva.wadsworth.org
 -------------------------------------------------
Updating worker shiva.wadsworth.org
Direct update
\cp -f ./cryosparc2_worker.tar.gz /home/cryosparc_user/software/cryosparc/cryosparc2_worker
cp: cannot create regular file '/home/cryosparc_user/software/cryosparc/cryosparc2_worker': No such file or directory
Failed to update shiva.wadsworth.org! Skipping...
 -------------------------------------------------
 ---------------------------------------------------
 Done updating all worker nodes.
 If any nodes failed to update, you can manually update them.
 Cluster worker installations must be manually updated.

 To update manually, simply copy the cryosparc2_worker.tar.gz
 file into the cryosparc worker installation directory, and then run
    $ bin/cryosparcw update
 from inside the worker installation directory.

cryosparc_user@shiva:~$ pwd
/install/cryosparc_user
cryosparc_user@shiva:~$ cd sof*
cryosparc_user@shiva:~/software$ ls
cryosparc
cryosparc_user@shiva:~/software$ cd cryo*
cryosparc_user@shiva:~/software/cryosparc$ ls -l
total 1206688
drwxrwxr-x 11 cryosparc_user syslog      4096 Feb 24 11:41 cryosparc2_master
-rw-r--r--  1 cryosparc_user syslog 581659450 Jan 22 16:47 cryosparc2_master.tar.gz
drwxrwxr-x  8 cryosparc_user syslog      4096 Feb 21 10:36 cryosparc2_worker
-rw-r--r--  1 cryosparc_user syslog 653965368 Jan 22 16:48 cryosparc2_worker.tar.gz
-rw-r--r--  1 root           root         344 Feb 21 10:22 install-cryo-command-howto

BrianCuttler · February 25, 2020, 4:00pm

Completely clean install of v2.14.2

This was my build script, minus user credentials

./install.sh    --standalone \
                --worker_path /install/cryosparc_user/software/cryosparc/cryosparc2_worker \
                --cudapath /usr/local/cuda \
                --hostname shiva.wadsworth.org \
                --ssdpath /cryo_tmp \
                --dbpath /data \
                --initial_email "cryo_manager@our-site.edu" \
                --initial_password cryopass \
                --initial_name "Cryo Manager"

but we are failing to run jobs and get this result

As you can see, there is an issue with the cryosparc_worker path.
User login directory in /etc/passwd is /install/cryosparc_user, we are apparently not entering the information for worker_path in the build, or there is an additional substitution happening in the build.

Please help us work around this issue.

thanks - Brian

cryosparc_user@shiva:~/software/cryosparc/cryosparc2_master$ cryosparcm log command_core
COMMAND CORE STARTED ===  2020-02-25 10:48:43.353781  ==========================
*** BG WORKER START
[GPU_INFO]: Failed calling the python function to get GPU info on shiva.wadsworth.org: Command '[u'bash -c "eval $(/home/cryosparc_user/software/cryosparc/cryosparc2_worker/bin/cryosparcw env); timeout 10 python /ho2
Failed to connect link: <urlopen error timed out>
HTTPSConnectionPool(host='get.cryosparc.com', port=443): Max retries exceeded with url: /heartbeat/d950c6e0-f1c9-11e9-a843-e7cc136975b6 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection obje)
Error connecting to cryoSPARC license server during instance heartbeat.
[EXPORT_PROJECT] : Exporting project P1...
[EXPORT_PROJECT] : Exported project P1 to /usr16/data/leith/cryosparcjnk/Test/P1/project.json in 0.02s
[EXPORT_PROJECT] : Exporting project P2...
[EXPORT_PROJECT] : Exported project P2 to /usr16/data/leith/cryosparcjnk/P2/project.json in 0.03s
HTTPSConnectionPool(host='get.cryosparc.com', port=443): Max retries exceeded with url: /heartbeat/d950c6e0-f1c9-11e9-a843-e7cc136975b6 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection obje)
Error connecting to cryoSPARC license server during instance heartbeat.
---- Killing project UID P1 job UID J16
     Killing job on worker type node shiva.wadsworth.org
     Killing job on worker on same node as master, not using ssh
Changed job P1.J16 status killed
[EXPORT_JOB] : Request to export P1 J16
[EXPORT_JOB] :    Exporting job to /usr16/data/leith/cryosparcjnk/Test/P1/J16
[EXPORT_JOB] :    Exporting all of job's images in the database to /usr16/data/leith/cryosparcjnk/Test/P1/J16/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.02s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J16 in 0.05s
[EXPORT_JOB] : Request to export P1 J16
[EXPORT_JOB] :    Exporting job to /usr16/data/leith/cryosparcjnk/Test/P1/J16
[EXPORT_JOB] :    Exporting all of job's images in the database to /usr16/data/leith/cryosparcjnk/Test/P1/J16/gridfs_data...
[EXPORT_JOB] :    Done. Exported 0 images in 0.00s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.00s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Done. Exported in 0.01s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P1 J16 in 0.08s
---------- Scheduler running ---------------
Jobs Queued:  [(u'P1', u'J16')]
Licenses currently active : 0
Now trying to schedule J16
  Need slots :  {u'GPU': 1, u'RAM': 1, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
   Scheduling job to shiva.wadsworth.org
Failed to connect link: HTTP Error 502: Bad Gateway
Not a commercial instance - heartbeat set to 12 hours.
     Launchable! -- Launching.
Changed job P1.J16 status launched
      Running project UID P1 job UID J16
        Running job on worker type node
        Running job using:  /home/cryosparc_user/software/cryosparc/cryosparc2_worker/bin/cryosparcw
[JSONRPC ERROR  2020-02-25 10:54:34.021839  at  run_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 114, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1936, in run_job
    close_fds = True )
  File "/install/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/install/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-02-25 10:54:34.024190  at  scheduler_run ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 114, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 1541, in scheduler_run
    scheduler_run_core(do_run)
  File "cryosparc2_command/command_core/__init__.py", line 1763, in scheduler_run_core
    run_job(job['project_uid'], job['uid']) # takes care of the cluster case and the node case
  File "cryosparc2_command/command_core/__init__.py", line 123, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------
[JSONRPC ERROR  2020-02-25 10:54:34.025840  at  enqueue_job ]
-----------------------------------------------------
Traceback (most recent call last):
  File "cryosparc2_command/command_core/__init__.py", line 114, in wrapper
    res = func(*args, **kwargs)
  File "cryosparc2_command/command_core/__init__.py", line 4352, in enqueue_job
    scheduler_run()
  File "cryosparc2_command/command_core/__init__.py", line 123, in wrapper
    raise e
OSError: [Errno 2] No such file or directory
-----------------------------------------------------

BrianCuttler · February 25, 2020, 6:26pm

Looks like the working is ok, but the master says the GPU resource is missing.

  Updating..
  Done.
 ---------------------------------------------------------------
  Final configuration for shiva.wadsworth.org
             monitor_port :  None
                     lane :  default
                     name :  shiva.wadsworth.org
                    title :  Worker node shiva.wadsworth.org
           resource_slots :  {u'GPU': [0], u'RAM': [0, 1], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7]}
                 hostname :  shiva.wadsworth.org
          worker_bin_path :  /install/cryosparc_user/software/cryosparc/cryosparc2_worker/bin/cryosparcw
               cache_path :  /cryo_tmp
           cache_quota_mb :  None
           resource_fixed :  {u'SSD': True}
                     gpus :  [{u'mem': 4230807552, u'id': 0, u'name': u'GeForce GTX 980'}]
         cache_reserve_mb :  10000
                     type :  node
                  ssh_str :  cryosparc_user@shiva.wadsworth.org
                     desc :  None
 ---------------------------------------------------------------


---------- Scheduler running ---------------
Jobs Queued:  [(u'P1', u'J16')]
Licenses currently active : 1
Now trying to schedule J16
  Need slots :  {u'GPU': 1, u'RAM': 1, u'CPU': 2}
  Need fixed :  {u'SSD': True}
  Master direct :  False
    Queue status : waiting_resources
    Queue message : GPU not available
---------- Scheduler finished ---------------

sdawood · February 26, 2020, 7:27pm

Hi @BrianCuttler,

From the logs, there doesn’t seem to be anything wrong - it looks like at the time, there was a job running already using the GPU. If you look at the resource manager tab, it will indicate which jobs are currently running and in the queue.

Can you confirm you were able to run that job once the other one completed?

Thanks,
Suhail

neeraj · June 3, 2020, 5:56am

Would you mind sharing how you fixed the above error ? We are having the same error now and struck.
Thanks