Stalled process during restart

I have problem with an installation on a cluster, which I upgraded to 3.3.1 in the beginning of January. The last few days, when I try to do “cryosparcm restart” the process stalls after a while. The symptoms are similar to this issue: https://discuss.cryosparc.com/t/webapp-not-starting-after-v3-0-1-update/5721

Here are the symptoms and my tests:

  • The start up sequence stalls after the message “command_rtp: started”.
  • If I cancel using Ctrl-c and run “cryosparmc status”, it shows that the database, command_core, command_rtp and command_vis are running, but not app, app_dev, liveapp, liveapp_dev, webapp and webapp_dev.
  • Also, the “cryosparcm status” command stalls before showing the information about variables etc.
  • I reinstalled the cryosparc_master and cryosparc_worker. The symtoms are still there. I cannot patch the installation (the command “cryosparcm patch” also stalls)
  • I am able to start the app, liveapp and webapp by running “cryosparcm start app” etc. But not the *_dev apps, they show “app_dev: ERROR (no such file)” etc. See log outputs from these apps below.
  • We had problems with the DNS server recently, and got an error about not being able to connect to get.cryosparc.com, Now that works, but could this be some lingering network issue?
  • One indication that there is a network problem is that when I run with “export DEBUG=true” in cryosparc_master/config.sh I get these error messages:
Attempt 1/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb8b218ea50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 2/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb8b21267d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 3/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb8b215ee50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Failed to GET http://a001:29445

cryosparcm log app

(node:20794) Warning: Accessing non-existent property 'count' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
(node:20794) Warning: Accessing non-existent property 'findOne' of module exports inside circular dependency
(node:20794) Warning: Accessing non-existent property 'remove' of module exports inside circular dependency
(node:20794) Warning: Accessing non-existent property 'updateOne' of module exports inside circular dependency

cryosparcm log liveapp

Ready to serve GridFS files
cryoSPARC v2 Application Server Started

cryosparcm log webapp

ESC[34mcryoSPARCESC[39m
(node:20610) DeprecationWarning: current Server Discovery and Monitoring engine is deprecated, and will be removed in a future version. To use the new Server Discover and Monitoring engine, pass option { useUnifiedTopology: true } to the MongoClient constructor.
ESC[32mReady to serve GridFSESC[39m
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [workspace] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [jobs] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [workspace] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [jobs] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [workspace] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [workspace] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [jobs] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [workspace] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [jobs] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
set_user_viewed_project
["616eacf9742a11a07af29c82","P27"]
==== [projects] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [workspace] project query user  616eacf9742a11a07af29c82 Leonardo false
==== [projects] project query user  6043f5bb590be8eaa57e4932 larsson true
==== [workspace] project query user  6043f5bb590be8eaa57e4932 larsson true
==== [jobs] project query user  6043f5bb590be8eaa57e4932 larsson true
==== [projects] project query user  6043f5bb590be8eaa57e4932 larsson true
==== [projects] project query user  6043f5bb590be8eaa57e4932 larsson true

Now I patched to version 220118 manually using these instructions, but the behavior is still the same:

@daniel.s.d.larsson As you mentioned

Does the command host a001 print out an IP address that matches your cryoSPARC master’s?

Yes, it gives the correct IP

Is the port accessible via
telnet a001 29445
?

That command gives the reply:

> telnet a001 29445
Trying 192.168.177.11...
telnet: connect to address 192.168.177.11: Connection refused

That was with CryoSPARC not running. Trying with other ports in the range 2944[0-9] gives the same results.

When I first tried to run “cryosparcm restart” and then Ctrl-c when it stalled, ports 29441 and 29442 gives a connection.

I can also add that I get additional error messages for port 29445. These goes on “forever”. Here is a full transcript:

cryosparcuser@a001:~/cryosparc/cryosparc_master cryosparcm restart
CryoSPARC is not already running.
If you would like to restart, use cryosparcm restart
Starting cryoSPARC System master process..
CryoSPARC is not already running.
database: started
Database configuration is OK.
command_core: started
Attempt 1/3 to GET http://a001:29442 failed with exception: HTTPConnectionPool(host='a001', port=29442): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcc8c3edbd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 2/3 to GET http://a001:29442 failed with exception: HTTPConnectionPool(host='a001', port=29442): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcc8c388890>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 3/3 to GET http://a001:29442 failed with exception: HTTPConnectionPool(host='a001', port=29442): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcc8c3bff90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Failed to GET http://a001:29442
    command_core connection succeeded
    command_core startup successful
command_vis: started
command_rtp: started
Attempt 1/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb9e6059a90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 2/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb9e5ff3850>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 3/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb9e602ae90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Failed to GET http://a001:29445
Attempt 1/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff152dc5b90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 2/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff152d5f910>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 3/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff152f12f10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Failed to GET http://a001:29445
Attempt 1/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9cc57eeb10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 2/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9cc5788890>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 3/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9cc57bfe90>: Failed to establish a new connection: [Errno 111] Connection refused'))
Failed to GET http://a001:29445
Attempt 1/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff54ea189d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 2/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff54e9b2750>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying...
Attempt 3/3 to GET http://a001:29445 failed with exception: HTTPConnectionPool(host='a001', port=29445): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff54e9e9dd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Failed to GET http://a001:29445
<snip>

@daniel.s.d.larsson With cryoSPARC “running” (to the extent it can run right now on your machine), does any process show up with this command (on the cryoSPARC master):
netstat -ap | grep 29445
?

Not that I can see (I don’t have root on this machine).

cryosparcuser@a001:~ netstat -ap | grep 29445
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)

@daniel.s.d.larsson Is there a hint in the command_rtp log?
cryosparcm log command_rtp
?

The issue has been resolved. The problem was the distributed file storage being unresponsive, which gave these symptoms. Thanks for all the help!

Excellent. Thanks for confirming,