"Our replica set config is invalid or we are not a member of " error when reusing CS Session

npavlovikj · April 3, 2024, 7:38pm

Hi,

I have a question regarding an issue I have when reusing already existing CryoSPARC directory where the CryoSPARC database and configuration files are stored for previous session.
I am using CryoSPARC 4.4.1, and all these files were generated with the same version.
On the other hand, starting a new Session works great and as expected.

As of now, I am using the same hostname as the previous session, just different port name.
After I change the port name in config.sh I run:

cryosparcm start
cryosparcm fixdbport

However, the error I get is:

+ cryosparcm start
Starting CryoSPARC System master process...
CryoSPARC is not already running.
configuring database...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/cryosparc/cryosparc_master/cryosparc_compute/database_management.py", line 47, in configure_mongo
    initialize_replica_set()
  File "/opt/cryosparc/cryosparc_master/cryosparc_compute/database_management.py", line 85, in initialize_replica_set
    admin_db.command('replSetGetStatus')  # check replset
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/database.py", line 828, in command
    return self._command(
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/database.py", line 703, in _command
    return sock_info.command(
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/pool.py", line 740, in command
    return command(
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/network.py", line 177, in command
    helpers._check_command_response(
  File "/opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/helpers.py", line 180, in _check_command_response
    raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: Our replica set config is invalid or we are not a member of it, full error: {'state': 10, 'stateStr': 'REMOVED', 'uptime': 3, 'optime': {'ts': Timestamp(1712170293, 4), 't': 4}, 'optimeDate': datetime.datetime(2024, 4, 3, 18, 51, 33), 'lastHeartbeatMessage': '', 'syncingTo': '', 'syncSourceHost': '', 'syncSourceId': -1, 'infoMessage': '', 'ok': 0.0, 'errmsg': 'Our replica set config is invalid or we are not a member of it', 'code': 93, 'codeName': 'InvalidReplicaSetConfig', 'operationTime': Timestamp(1712170293, 4), '$clusterTime': {'clusterTime': Timestamp(1712170293, 4), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}}
[2024-04-03T13:53:03-05:00] Error configuring database. Most recent database log lines:
...

I have tried changing the database port before/after starting/restarting the Master, but the issue persists.
I haven’t been able to find much information on the meaning of Our replica set config is invalid or we are not a member of it, and I have tried other suggestions provided here regarding similar issues, but without success.

I would really appreciate if you can help me in fixing this issue.
If you need any additional information, please let me know.

Thank you,
Natasha

wtempel · April 4, 2024, 4:53pm

Welcome to the forum @npavlovikj .

Do you mean pre-existing CrySPARC installation?

Please can you post approx. 20 lines with timestamps just before 2024-04-03T13:53:03-05:00 from the file

/opt/cryosparc/cryosparc_master/run/database.log

and the output of the command

ps -eo user,pid,ppid,start,cmd | grep -e mongo -e cryosparc_

npavlovikj · April 4, 2024, 9:03pm

Hi @wtempel , thank you for the prompt reply!

I am using the same CryoSPARC installation and version, so there is no database migration among different versions. When I say “previous session”, I mean a previous, terminated run of CryoSPARC, where all the *.wt, *.lock and *.log files are stored. I want to restart CryoSPARC using this non-empty directory instead of creating a new one. I am not user of CryoSPARC, but I have been told by researchers that sometimes they need to run CryoSPARC from a directory with previously generated database and configuration files available for reuse.

After I start CryoSPARC from the preexisting directory, this is the output in database.log:

2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] MongoDB starting : pid=2927759 port=18884 dbpath=/home/npavlovikj/cryosparc 64-bit host=2420
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] db version v3.6.23
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] git version: d352e6a4764659e0d0350ce77279de3c1f243e5c
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] modules: none
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] build environment:
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten]     distarch: x86_64
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2024-04-04T15:46:20.470-0500 I CONTROL  [initandlisten] options: { net: { port: 18884 }, replication: { oplogSizeMB: 64, replSet: "meteor" }, storage: { dbPath: "/home/npavlovikj/cryosparc" } }
2024-04-04T15:46:20.470-0500 W -        [initandlisten] Detected unclean shutdown - /home/npavlovikj/cryosparc/mongod.lock is not empty.
2024-04-04T15:46:20.471-0500 I -        [initandlisten] Detected data files in /home/npavlovikj/cryosparc created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2024-04-04T15:46:20.471-0500 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2024-04-04T15:46:20.472-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=95258M,cache_overflow=(file_max=0M),session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,stat
istics=(fast),compatibility=(release="3.0",require_max="3.0"),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress
),
2024-04-04T15:46:21.220-0500 I STORAGE  [initandlisten] WiredTiger message [1712263581:220095][2927759:0x1540141de500], txn-recover: Main recovery loop: starting at 4/777600
2024-04-04T15:46:21.220-0500 I STORAGE  [initandlisten] WiredTiger message [1712263581:220754][2927759:0x1540141de500], txn-recover: Recovering log 4 through 5
2024-04-04T15:46:21.271-0500 I STORAGE  [initandlisten] WiredTiger message [1712263581:271054][2927759:0x1540141de500], file:collection-10-922672420683722901.wt, txn-recover: Recovering log 5 through 5
2024-04-04T15:46:21.312-0500 I STORAGE  [initandlisten] WiredTiger message [1712263581:312758][2927759:0x1540141de500], file:collection-10-922672420683722901.wt, txn-recover: Set global recovery timestamp: 0
2024-04-04T15:46:21.323-0500 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2024-04-04T15:46:21.323-0500 I STORAGE  [initandlisten] The size storer reports that the oplog contains 364 records totaling to 2560655 bytes
2024-04-04T15:46:21.323-0500 I STORAGE  [initandlisten] Scanning the oplog to determine where to place markers for truncation
2024-04-04T15:46:21.328-0500 I STORAGE  [initandlisten] WiredTiger record store oplog processing took 4ms
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] ** WARNING: This server is bound to localhost.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          Remote systems will be unable to connect to this server. 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          Start the server with --bind_ip <address> to specify which IP 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          addresses it should serve responses from, or with --bind_ip_all to
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          bind to all interfaces. If this behavior is desired, start the
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          server with --bind_ip 127.0.0.1 to disable this warning.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] ** WARNING: You are running on a NUMA machine.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **              numactl --interleave=all mongod [other options]
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 65536 files. Number of processes should be at least 32768 : 0.5 times number of files.
2024-04-04T15:46:21.329-0500 I CONTROL  [initandlisten] 
2024-04-04T15:46:21.354-0500 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/home/npavlovikj/cryosparc/diagnostic.data'
2024-04-04T15:46:21.360-0500 I REPL     [initandlisten] Rollback ID is 1
2024-04-04T15:46:21.363-0500 I REPL     [initandlisten] No oplog entries to apply for recovery. appliedThrough and checkpointTimestamp are both null.
2024-04-04T15:46:21.363-0500 I NETWORK  [initandlisten] listening via socket bound to 127.0.0.1
2024-04-04T15:46:21.363-0500 I NETWORK  [initandlisten] listening via socket bound to /tmp/mongodb-18884.sock
2024-04-04T15:46:21.363-0500 I NETWORK  [initandlisten] waiting for connections on port 18884
2024-04-04T15:46:21.363-0500 W NETWORK  [replexec-0] Failed to connect to 127.0.0.1:17142, in(checking socket for error after poll), reason: Connection refused
2024-04-04T15:46:21.363-0500 W REPL     [replexec-0] Locally stored replica set configuration does not have a valid entry for the current node; waiting for reconfig or remote heartbeat; Got "NodeNotFound: No host 
described in new configuration 1 for replica set meteor maps to this node" while validating { _id: "meteor", version: 1, protocolVersion: 1, members: [ { _id: 0, host: "localhost:17142", arbiterOnly: false, buildI
ndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpT
imeoutMillis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('660f10355478bad9946224a4') } }
2024-04-04T15:46:21.363-0500 I REPL     [replexec-0] New replica set config in use: { _id: "meteor", version: 1, protocolVersion: 1, members: [ { _id: 0, host: "localhost:17142", arbiterOnly: false, buildIndexes: 
true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMi
llis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('660f10355478bad9946224a4') } }
2024-04-04T15:46:21.363-0500 I REPL     [replexec-0] This node is not a member of the config
2024-04-04T15:46:21.363-0500 I REPL     [replexec-0] transition to REMOVED from STARTUP
2024-04-04T15:46:21.364-0500 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for meteor/localhost:17142
2024-04-04T15:46:21.364-0500 W NETWORK  [LogicalSessionCacheRefresh] Failed to connect to 127.0.0.1:17142, in(checking socket for error after poll), reason: Connection refused
2024-04-04T15:46:21.364-0500 I CONTROL  [LogicalSessionCacheReap] Sessions collection is not set up; waiting until next sessions reap interval: config.system.sessions does not exist
2024-04-04T15:46:21.373-0500 W NETWORK  [LogicalSessionCacheRefresh] Unable to reach primary for set meteor
2024-04-04T15:46:21.373-0500 I NETWORK  [LogicalSessionCacheRefresh] Cannot reach any nodes for set meteor. Please check network connectivity and the status of the set. This has happened for 1 checks in a row.

Each CryoSPARC job on our cluster is terminated via Slurm, so there are no existing Mongo/CryoSPARC processes running on the node. When I restart CryoSPARC from the preexisting directory, this is the grep output I see:

[npavlovikj@2420~]$ ps -eo user,pid,ppid,start,cmd | grep -e mongo -e cryosparc_
npavlov+ 2927723       1 15:46:18 python /opt/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /opt/cryosparc/cryosparc_master/supervisord.conf
npavlov+ 2931613 2927094 15:50:01 bash /opt/cryosparc/cryosparc_worker/bin/cryosparcw connect --worker 2420 --master 2420 --port 18883 --ssdpath /tmp --gpus 0 --rams 1 --cpus 1
npavlov+ 2932443 2928900 15:51:01 grep --color=auto -e mongo -e cryosparc_

In my current setup, the node is the same, just the port number changes.
After I am able to reuse the directory with the same node, I would like to be able to do that with different nodes as well.

Please let me know if you need any additional information.

Thank you,
Natasha

wtempel · April 5, 2024, 2:07pm

CryoSPARC is designed around a master-worker pattern. The most intuitive use case, to me at least, would include long-running master processes with rare configuration changes. Other use cases are feasible, but would require special management.
I infer from your posts so far that in your specific case (please correct me as needed):

CryoSPARC master processes are controlled by slurm. Please be aware that one should avoid sending SIGKILL to CryoSPARC master processes like the database.
the CryoSPARC master host name and port (actually a range of consecutive ports) may change between shutdown and restart of a CryoSPARC instance.

Are these assumptions correct?
Do the data processing jobs initiated by the CryoSPARC master intended to run only on the same host as the CryoSPARC master processes (“single workstation” type instance), or are these data processing jobs individually submitted to slurm (see Clusters)?

In the example you showed, you may prepare CryoSPARC for a “session” with a modified port range by running the command

cryosparcm changeport 18883

assuming that ports 18883 through 18892 are not already in use.

npavlovikj · April 5, 2024, 4:11pm

Hi @wtempel ,

Yes, your assumptions are correct.
We run CryoSPARC on our HPC Cluster, so each time CryoSPARC is launched, there is new master node, worker node, and port assigned.

I have tried combinations of cryosparcm fixdbport and cryosparcm changeport (as I have seen in some similar topics here), but none worked for me.

I was finally able to update the hostname and the port correctly when CryoSPARC is started from existing data directory using:

cryosparcm start database &
sleep 30
cryosparcm fixdbport &
cryosparcm restart &
sleep 30

in that particular order only.

Since I was able to find solution to my question, please go ahead and close this topic.

Thank you for your help!

Thank you,
Natasha