Warning: Could not get database status (attempt 1/3) - issue after storage outage

YvesT · July 7, 2023, 11:59am

Hi,
after a recent storage outage on our cluster it appears that our cryosparc DB has some issues.
cryosparm status returns:

----------------------------------------------------------------------------
CryoSPARC System master node installed at
/XXX/cryosparc_master
Current cryoSPARC version: v4.2.1
----------------------------------------------------------------------------

CryoSPARC is not running.

----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_LICENSE_ID="XXXX"
export CRYOSPARC_MASTER_HOSTNAME="XXXXX"
export CRYOSPARC_DB_PATH="YYYYY/db"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true
export CRYOSPARC_FORCE_HOSTNAME=true

Outside of the short term outage nothing changed. It is no longer possible to start the cryosparc services:

CryoSPARC is not already running.
If you would like to restart, use cryosparcm restart
Starting cryoSPARC System master process..
CryoSPARC is not already running.
configuring database
Warning: Could not get database status (attempt 1/3)
Warning: Could not get database status (attempt 2/3)
Warning: Could not get database status (attempt 3/3)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/XXX/cryosparc_master/cryosparc_compute/database_management.py", line 48, in configure_mongo
    initialize_replica_set()
  File "/XXX/cryosparc_master/cryosparc_compute/database_management.py", line 87, in initialize_replica_set
    admin_db = try_get_pymongo_admin_db(mongo_client)
  File "/XXX/cryosparc_master/cryosparc_compute/database_management.py", line 249, in try_get_pymongo_admin_db
    admin_db.command(({'serverStatus': 1}))
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/database.py", line 827, in command
    with self.__client._socket_for_reads(read_preference, session) as (sock_info, secondary_ok):
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1478, in _socket_for_reads
    server = self._select_server(read_preference, session)
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1436, in _select_server
    server = topology.select_server(server_selector)
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/topology.py", line 250, in select_server
    return random.choice(self.select_servers(selector, server_selection_timeout, address))
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/topology.py", line 211, in select_servers
    server_descriptions = self._select_servers_loop(selector, server_timeout, address)
  File "/XXX/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/topology.py", line 226, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: localhost:39001: [Errno 111] Connection refused, Timeout: 20.0s, Topology Description: <TopologyDescription id: 64a7f5e1311935e61b62bf93, topology_type: Single, servers: [<ServerDescription ('localhost', 39001) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:39001: [Errno 111] Connection refused')>]>
[2023-07-07T13:25:28+0200] Error configuring database. Most recent database log lines:
 mongod(wiredtiger_open+0x1BBA) [0x5556f4192c8a]
 mongod(_ZN5mongo18WiredTigerKVEngineC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_PNS_11ClockSourceES8_mmbbbb+0x8D6) [0x5556f415fcf6]
 mongod(+0xA25AEC) [0x5556f4141aec]
 mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x266) [0x5556f4351fb6]
 mongod(+0xA025B8) [0x5556f411e5b8]
 mongod(_ZN5mongo11mongoDbMainEiPPcS1_+0x26C) [0x5556f412163c]
 mongod(main+0x9) [0x5556f40a7bc9]
 libc.so.6(__libc_start_main+0xF5) [0x7f9c887303d5]
 mongod(+0x9ED741) [0x5556f4109741]

ps xww | grep -e cryosparc -e mongo
returns no running processes on all nodes that could run cryosparc related jobs

I did however find a mongod.lock file in our db folder timestamped with the last scheduled reboot of cryosparc. fuser on it returns empty

From this thread Help! I seem to have broken cryosparc by moving the cyrosparc_user home directory to a new location and then moving it back again!
I assume the way forward is to delete the mongod.lock file followed by a restart of the cryosparc services

Would you suggest to run the mongodb recovery before or after trying to restart cryosparc? Last resort will of course be recovery via backup of the DB to a state previous to the outage.

wtempel · July 7, 2023, 2:27pm

Welcome to the forum @YvesT.

I would not remove the mongod.lock file at this time, but only (and possibly, as I do not recall encountering such a situation) after confirming that the mere presence of mongod.lock is a problem and there are no other, underlying problems.
Please can you check /XXX/cryosparc_master/run/database.log for additional error messages.

YvesT · July 7, 2023, 3:27pm

Thanks!

the database.log file essentially gives repeatedly (on every startup of cryosparc try) the same error since it failed:

2023-07-06T22:05:02.489+0200 I CONTROL  [initandlisten] MongoDB starting : pid=54893 port=39001 dbpath=XXXXXX/db 64-bit host=YYYYY
2023-07-06T22:05:02.489+0200 I CONTROL  [initandlisten] db version v3.6.23
2023-07-06T22:05:02.489+0200 I CONTROL  [initandlisten] git version: d352e6a4764659e0d0350ce77279de3c1f243e5c
2023-07-06T22:05:02.489+0200 I CONTROL  [initandlisten] allocator: tcmalloc
2023-07-06T22:05:02.490+0200 I CONTROL  [initandlisten] modules: none
2023-07-06T22:05:02.490+0200 I CONTROL  [initandlisten] build environment:
2023-07-06T22:05:02.490+0200 I CONTROL  [initandlisten]     distarch: x86_64
2023-07-06T22:05:02.490+0200 I CONTROL  [initandlisten]     target_arch: x86_64
2023-07-06T22:05:02.490+0200 I CONTROL  [initandlisten] options: { net: { port: 39001 }, replication: { oplogSizeMB: 64, replSet: "meteor" }, storage: { dbPath: "XXXXX/db", journal: { enabled: false } } }
2023-07-06T22:05:02.496+0200 W -        **[initandlisten] Detected unclean shutdown - /XXX/mongod.lock is not empty.**
2023-07-06T22:05:02.498+0200 I -        [initandlisten] Detected data files in /XXX/db created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2023-07-06T22:05:02.498+0200 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2023-07-06T22:05:02.509+0200 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=31612M,cache_overflow=(file_max=0M),session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),compatibility=(release="3.0",require_max="3.0"),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),,log=(enabled=false),
2023-07-06T22:05:02.699+0200 E STORAGE  [initandlisten] WiredTiger error (-31803) [1688673902:699614][54893:0x7fd5a8a5da40], file:WiredTiger.wt, connection: __wt_turtle_read, 336: WiredTiger.turtle: fatal turtle file read error: WT_NOTFOUND: item not found Raw: [1688673902:699614][54893:0x7fd5a8a5da40], file:WiredTiger.wt, connection: __wt_turtle_read, 336: WiredTiger.turtle: fatal turtle file read error: WT_NOTFOUND: item not found
2023-07-06T22:05:02.699+0200 E STORAGE  [initandlisten] WiredTiger error (-31804) [1688673902:699687][54893:0x7fd5a8a5da40], file:WiredTiger.wt, connection: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1688673902:699687][54893:0x7fd5a8a5da40], file:WiredTiger.wt, connection: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic
2023-07-06T22:05:02.699+0200 F -        [initandlisten] Fatal Assertion 50853 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 420
2023-07-06T22:05:02.699+0200 F -        [initandlisten] \n\n***aborting after fassert() failure\n\n
2023-07-06T22:05:02.715+0200 F -        [initandlisten] Got signal: 6 (Aborted).
 0x55e3f4289f21 0x55e3f4289139 0x55e3f428961d 0x7fd5a77a05d0 0x7fd5a73fa207 0x7fd5a73fb8f8 0x55e3f296ddec 0x55e3f2a48d76 0x55e3f2abaad1 0x55e3f290aa94 0x55e3f290aeb4 0x55e3f2a7e2e6 0x55e3f2a7c484 0x55e3f2a5c140 0x55e3f2ab93bd 0x55e3f2ab999d 0x55e3f2ab9c2c 0x55e3f2b2be52 0x55e3f2ab4ec8 0x55e3f2a7b97e 0x55e3f2a7ba5b 0x55e3f2a5ac8a 0x55e3f2a27cf6 0x55e3f2a09aec 0x55e3f2c19fb6 0x55e3f29e65b8 0x55e3f29e963c 0x55e3f296fbc9 0x7fd5a73e63d5 0x55e3f29d1741

How would you suggest to proceed?

wtempel · July 10, 2023, 5:16pm

The database seems to be corrupt. As a first step, I recommend eliminating, to the extent possible, the risk of a future storage outage.
For pre-requisites of running
cryosparcm restore, please see the guide. For an important limitation of
cryosparcm restore, please see below.
Ensure all CryoSPARC processes are terminated by running (under the CryoSPARC-related Linux account)

cryosparcm stop
then
ps -u $USER -opid,pid,ppid | grep -e cryosparc -e mongo
to identify potentially left-over CryoSPARC-related processes, which should be terminated with
kill -TERM

If additional processing was performed on the CryoSPARC instance after the most recent database backup, a hypothetical database restoration may result in the overwriting of more current information in project directories by out-of-date information from the restored database. If this is a concern, I recommend

updating the CRYOSPARC_DB_PATH= definition inside /XXX/cryosparc_master/config.sh with a new, suitable path.
then running
cryosparcm start
which should start CryoSPARC with a blank database.
re-creating CryoSPARC users in the database with
cryosparcm createuser
commands (guide)
registering CryoSPARC workers in the database with
cryosparcw connect commands (on the worker(s), guide) or
cryosparcm cluster connect command(s) (guide), as applicable.
attaching project directories. You may have to delete the cs.lock file from any given project directory before attachment because the instance id stored in the new, blank database differs from the old instance id stored in the old database. This is an exception from the rule that cs.lock files should generally not be deleted.

[edited for spelling 2024-04-12]

YvesT · July 12, 2023, 12:58pm

After some trial and error we managed to restore our DB using a snapshot of the DB folder taken before the outage. As the snapshot was not a proper mongodb dump this required another step of running

mongod --dbpath /XXX/db --repair

before being able to use it in cryosparc again, similar as described here.
The mongod executable shipped with cryosparc can be found here:

XXX/cryosparc_master/deps/external/mongodb/bin/mongod

Thanks for the help.

Sidnote:

I would not remove the mongod.lock file at this time, but only (and possibly, as I do not recall encountering such a situation) after confirming that the mere presence of mongod.lock is a problem and there are no other, underlying problems.

As suggested by @wtempel in the first reply, removing or emptying the mongod.lock did not help.