OtherError: "node is not in primary or recovering state" when connecting cluster

Hello,

I want to install cryoSparc on a cluster. When i “cryosparcm cluster connect”, an error occurred as follows.

 Traceback (most recent call last):
   File "<stdin>", line 8, in <module>
   File "cryosparc2_compute/client.py", line 57, in func
     assert False, res['error']
AssertionError: {u'message': u'OtherError: node is not in primary or recovering state', u'code': 500, u'data': None, u'name': u'OtherError'}

What does it mean by saying node is not in primary or recovering state? And how to solve this problem?
Thank you!

Hey @kortal,

Thanks for posting.
These type of error messages usually arise when the database wasn’t initialized properly, which happens when you run cryosparcm start for the first time after installing. The easy fix is to restart cryoSPARC: cryosparcm restart, which will run the initialization function again. After that, try running the cryosparcm cluster connect command again to see if it’s working. If it isn’t, attach the logs of command_core and database (cryosparcm log command_core, cryosparcm log database), and we can see what’s wrong.

Hey, Stephan, thank you for replying, it doesn’t work, however…
Let me describe my process first:
I want to install the software on a remote cluster: graham.computecanada.ca (which is a cluster of ComputeCanada). The first problem is that every time I use hostname user@graham.computecanada.ca to login in, then in the system, the hostname will become either gra-login3.graham.sharcnet or gra-login2.graham.sharcnet or gra-login1.graham.sharcnet. It seems to randomly change. Therefore, I still use the graham.computecanada.ca as the hostname for both worker-hostname and master host-name.
This is the second installation, in the first installation, the cryosparcm start works well initially. But when i excute cryosparcm cluster dump , it reminds that no such cluster existed. I tried many differnt name that i can found about the cluster, but still doesn’t work. Therefore, I tried to cryosparcm cluster connect directly and manuly edit the .json file. After this step, I can’t restart the cryosparcm. So I delete all cryosparc file and reinstall the software.

Now when I cryosparcm restart in the worker folder, there’s an error:

CryoSPARC is running.
Stopping cryosparc.
unix:///tmp/cryosparc-supervisor-f97bde01964489ba6e140782f612b326.sock refused connection
ERROR: unix:///tmp/cryosparc-supervisor-f97bde01964489ba6e140782f612b326.sock refused connection (already shut down?)
Starting cryoSPARC System master process..
CryoSPARC is already running.
If you would like to restart, use cryosparcm restart

Here’s the cryosparcm status:

Current cryoSPARC version: v2.14.2

cryosparcm process status:

unix:///tmp/cryosparc-supervisor-f97bde01964489ba6e140782f612b326.sock refused connection
global config variables:

export CRYOSPARC_LICENSE_ID=“xxxx”
export CRYOSPARC_MASTER_HOSTNAME=“xxxx”
export CRYOSPARC_DB_PATH=“xxxx”
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true

Here’s the cryosparcm log command_core

Scheduler Failed
Heartbeat check failed
[JSONRPC ERROR 2020-05-03 12:10:43.656686 at get_num_active_licenses ]

Traceback (most recent call last):
File “cryosparc2_command/command_core/init.py”, line 114, in wrapper
res = func(*args, **kwargs)
File “cryosparc2_command/command_core/init.py”, line 1421, in get_num_active_licenses
for j in jobs_running:
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/cursor.py”, line 1114, in next
if len(self.__data) or self._refresh():
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/cursor.py”, line 1036, in _refresh
self.__collation))
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/cursor.py”, line 928, in __send_message
helpers._check_command_response(doc[‘data’][0])
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/helpers.py”, line 210, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
OperationFailure: node is not in primary or recovering state


Traceback (most recent call last):
File “cryosparc2_command/command_core/init.py”, line 198, in background_worker
concurrent_job_monitor()
File “cryosparc2_command/command_core/init.py”, line 1428, in concurrent_job_monitor
current_concurrent_licenses_deque.append(get_num_active_licenses())
File “cryosparc2_command/command_core/init.py”, line 123, in wrapper
raise e
OperationFailure: node is not in primary or recovering state
Traceback (most recent call last):
File “cryosparc2_command/command_core/init.py”, line 203, in background_worker
heartbeat_manager()
File “cryosparc2_command/command_core/init.py”, line 1472, in heartbeat_manager
active_jobs = get_active_licenses()
File “cryosparc2_command/command_core/init.py”, line 1437, in get_active_licenses
for j in jobs_running:
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/cursor.py”, line 1114, in next
if len(self.__data) or self._refresh():
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/cursor.py”, line 1036, in _refresh
self.__collation))
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/cursor.py”, line 928, in __send_message
helpers._check_command_response(doc[‘data’][0])
File “/home/pangguot/cryosparc/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/pymongo/helpers.py”, line 210, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
OperationFailure: node is not in primary or recovering state

Then it repeats to show this information.

Here’s the cryosparcm log database

2020-05-03T12:17:10.461-0400 I NETWORK [thread1] connection accepted from 199.241.166.2:37002 #5561 (6 connections now open)
2020-05-03T12:17:10.461-0400 I NETWORK [conn5561] received client metadata from 199.241.166.2:37002 conn5561: { driver: { name: “nodejs”, version: “2.2.34” }, os: { type: “Linux”, name: “linux”, architecture: “x64”, version: “3.10.0-957.12.2.el7.x86_64” }, platform: “Node.js v8.9.4, LE, mongodb-core: 2.1.18” }
2020-05-03T12:17:10.485-0400 I - [conn5559] end connection 199.241.166.2:36998 (6 connections now open)
2020-05-03T12:17:10.485-0400 I - [conn5560] end connection 199.241.166.2:37000 (6 connections now open)
2020-05-03T12:17:10.485-0400 I - [conn5561] end connection 199.241.166.2:37002 (6 connections now open)
2020-05-03T12:17:11.883-0400 I NETWORK [thread1] connection accepted from 199.241.166.2:37004 #5562 (4 connections now open)
2020-05-03T12:17:11.883-0400 I NETWORK [conn5562] received client metadata from 199.241.166.2:37004 conn5562: { driver: { name: “PyMongo”, version: “3.4.0” }, os: { type: “Linux”, name: “CentOS Linux 7.5.1804 Core”, architecture: “x86_64”, version: “3.10.0-957.12.2.el7.x86_64” }, platform: “CPython 2.7.15.final.0” }
2020-05-03T12:17:11.948-0400 I - [conn5562] end connection 199.241.166.2:37004 (4 connections now open)
2020-05-03T12:17:12.026-0400 I NETWORK [thread1] connection accepted from 199.241.166.2:37014 #5563 (4 connections now open)
2020-05-03T12:17:12.031-0400 I NETWORK [conn5563] received client metadata from 199.241.166.2:37014 conn5563: { driver: { name: “nodejs”, version: “2.2.34” }, os: { type: “Linux”, name: “linux”, architecture: “x64”, version: “3.10.0-957.12.2.el7.x86_64” }, platform: “Node.js v8.9.4, LE, mongodb-core: 2.1.18” }

Hi @kortal,

This might be the cause of your original issue. This can also cause the error message you posted below, since the UNIX sock file exists on the filesystem, but the node trying to execute the cryosparcm command doesn’t have access to the actual process itself.

Please take a look at this post:

Deleting the CRYOSPARC_MASTER_HOSTNAME variable from the config.sh file will allow you to use cryoSPARC in an environment where the host is not guaranteed across SSH sessions. Your workflow would be to turn on cryoSPARC, do some processing, then turn it off once you’re done.

I realize this may be tedious, but your other option would be to request a dedicated node where you can install cryoSPARC and keep it running for long periods of time.

For your current problem, try deleting the file mentioned, then kill off any processes that relate to cryoSPARC by using

ps -ax | grep "supervisor"
kill <pid of process>

Thanks, I reinstalled the software, now it could connect cluster.
But now I have another two problems:
(1) When I start “cryosparcm restart”, I got prompt like this:

CryoSPARC is running.
Stopping cryosparc.
unix:///tmp/cryosparc-supervisor-f97bde01964489ba6e140782f612b326.sock refused connection
ERROR: unix:///tmp/cryosparc-supervisor-f97bde01964489ba6e140782f612b326.sock refused connection (already shut down?)
Starting cryoSPARC System master process…
CryoSPARC is already running.
If you would like to restart, use cryosparcm restart

I delete the HOSTNAME line in config.sh file, but it still has error.

(2) I can’t acquire remote UI on my linux (which is based on debian) laptop.
when I type “ssh -N -f -L localhost:39000:localhost:39000 graham.computecanada.ca

I got prompt like this:

bind: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: 39000
Could not request local forwarding.

I see the cryosparcm log webapp then got something like this:

cryoSPARC v2
(node:3750) DeprecationWarning: current Server Discovery and Monitoring engine is deprecated, and will be removed in a future version. To use the new Server Discover and Monitoring engine, pass option { useUnifiedTopology: true } to the MongoClient constructor.
Ready to serve GridFS
events.js:183
throw er; // Unhandled ‘error’ event
^

Error: listen EADDRINUSE 0.0.0.0:39000
at Object._errnoException (util.js:1022:11)
at _exceptionWithHostPort (util.js:1044:20)
at Server.setupListenHandle [as _listen2] (net.js:1351:14)
at listenInCluster (net.js:1392:12)
at doListen (net.js:1501:7)
at _combinedTickCallback (internal/process/next_tick.js:141:11)
at process._tickDomainCallback (internal/process/next_tick.js:218:9)

Thank you so much!

Solved:

Just update:

(1) I delete that file /tmp/cryosparc-supervisor-f97bde01964489ba6e140782f612b326.sock as said in other topics, then I could start the cryosparc
(2) I can’t use linux to acquire remote UI, but I tried using Putty and use the command line

lsof -ti:39000 | xargs kill -9

to kill whatever is using port 39000

Then I can access the remote UI in Chrome in Windows 10.

Another thing is: don’t run winSCP if you want to get remote UI on windows. It seems if you run winSCP, you can’t open the remote UI on win10.

1 Like