cryoSPARC is running but supervisorctl doesn't think so

open

#1

This has happened to multiple different machines running in a standalone manner and one that is running the master process with the worker processes on separate machines. The information I am using is from the machine running only the master process. They are running v2.11.0 but I have seen this issue on all versions of cryoSPARC2.

After some unknown amount of time it appears the command ‘cryosparcm status’ reports cryoSPARC is not running but user’s are still able to use it an run jobs.

$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/usr/local/cryosparc2/cryosparc2_master
Current cryoSPARC version: v2.11.0
----------------------------------------------------------------------------

cryosparcm is not running.

----------------------------------------------------------------------------

global config variables:

export CRYOSPARC_LICENSE_ID="my_license_id"
export CRYOSPARC_MASTER_HOSTNAME="my_fqdn"
export CRYOSPARC_DB_PATH="/scratch/cryosparc2_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false

However, when I look at the processes running I see many with the string cryosparc in them

$ ps -ef | grep cryosparc
UID        PID  PPID  C STIME TTY          TIME CMD
user 11525     1  0 Sep24 ?        00:41:52 /usr/local/cryosparc2/cryosparc2_master/deps/anaconda/bin/python /usr/local/cryosparc2/cryosparc2_master/deps/anaconda/bin/supervisord -c supervisord.conf
user 11527 11525  0 Sep24 ?        05:27:48 mongod --dbpath /scratch/cryosparc2_database --port 39001 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4
user 11603 11525  0 Sep24 ?        02:04:58 python -c import cryosparc2_command.command_core as serv; serv.start(port=39002)
user 11632 11525  0 Sep24 ?        00:00:04 python -c import cryosparc2_command.command_proxy as serv; serv.start(port=39004)
user 11637 11525  0 Sep24 ?        01:14:50 /usr/local/cryosparc2/cryosparc2_master/cryosparc2_webapp/nodejs/bin/node ./bundle/main.js
user 11931 11525 99 17:04 ?        00:00:01 python -c import cryosparc2_command.command_vis as serv; serv.start(port=39003)

I did notice the socket, /tmp/cryosparc-supervisor*.sock, is missing so I am lead to believe that for some reason the socket was removed and now there is no way to communicate with the supervisord process that manages the various pieces of the software.

When this happens what is the recommended course of action?


#2

Hi @clil16,

Thanks for reporting this. This is an issue we have seen also for a long time on cluster/enterprise systems, but have never been able to replicate.
The core issue (as you identified) seems to be that after some intermediate amount of time, the /tmp directory where supervisorctl creates the .sock file is emptied, or at least that one file is destroyed. We have never seen this happen on our own systems, and for people who have seen it happen, they have not been able to identify any mechanism (cron job etc) that should have emptied /tmp.
Before v2.11, the .sock file had 777 permissions so that other users could use cryosparcm. Since v2.11, that has changed to 600 permissions, to try and stop other processes from deleting the file, but clearly as in your case, this has not helped.

If you somehow come across the reason for the sock file dissappearing, please let us know!

Unfortunately all you can do is to manually kill the remaining cryosparc processes, and then cryosparcm start once again.