What to do in the case of unrecoverable Mongo DB corruption?

pgoetz · November 17, 2021, 2:48pm

We suffered what is essentially the worst case scenario for data corruption: the hardware RAID controller failed catastrophically with ECC memory errors while several cryosparc jobs were running on multiple nodes. Unfortunately I do not have a current backup of the database (will address that as soon as we’ve recovered). Meanwhile, I am unable to recover the corrupted database.

I’ve seen several posts in this forum where the last entry is something like

 mongod --dbpath run/db --repair

didn’t work – what do I do now? Hoping to get an answer this time. Presumably one possibility would be to just delete the database entirely and start from scratch?

cd cryosparc_database
rm -rf *

What would be the consequences of this; i.e. does this delete DB metadata created on install, so would necessitate a re-install of cryosparc? Based on this document:
https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/tutorial-data-management-in-cryosparc
Projects and jobs are now self-contained and can repopulate the database from internal metadata. So what would be lost is information about users and passwords? Anything else?

Edit: I should probably mention that when I try to repair the database, mongod core dumps:

 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55dd452e7ac1]
 mongod(+0x1532CD9) [0x55dd452e6cd9]
 mongod(+0x15331BD) [0x55dd452e71bd]
 libpthread.so.0(+0x153C0) [0x7fce067e13c0]
 libc.so.6(gsignal+0xCB) [0x7fce0662018b]
 libc.so.6(abort+0x12B) [0x7fce065ff859]
 mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x55dd445bbe97]
 mongod(+0x126AB66) [0x55dd4501eb66]
 mongod(__wt_eventv+0x3D7) [0x55dd445c5b46]
 mongod(__wt_err+0x9D) [0x55dd445c5d62]
 mongod(__wt_panic+0x2E) [0x55dd445c5fc4]
 mongod(__wt_block_extlist_read+0x8F) [0x55dd45bebc9f]
 mongod(__wt_block_extlist_read_avail+0x2B) [0x55dd45bec1eb]
 mongod(__wt_block_checkpoint_load+0x26D) [0x55dd45be87dd]
 mongod(+0x1E393B7) [0x55dd45bed3b7]
 mongod(__wt_btree_open+0xB43) [0x55dd45c06ff3]
 mongod(__wt_conn_btree_open+0x19B) [0x55dd45c406cb]
 mongod(__wt_session_get_btree+0xFB) [0x55dd45ccb50b]
 mongod(__wt_session_get_btree+0x63D) [0x55dd45ccba4d]
 mongod(__wt_session_get_btree_ckpt+0x14C) [0x55dd45ccbd0c]
 mongod(__wt_curfile_open+0x161) [0x55dd45c4edf1]
 mongod(+0x1F0D838) [0x55dd45cc1838]
 mongod(__wt_metadata_cursor_open+0x6E) [0x55dd45c8b7fe]
 mongod(__wt_metadata_cursor+0x4B) [0x55dd45c8b8db]
 mongod(wiredtiger_open+0x1659) [0x55dd45c3c8b9]
mongod(_ZN5mongo18WiredTigerKVEngineC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_PNS_11ClockSourceES8_mbbbb+0x73A) [0x55dd4500388a]
 mongod(+0x12480C5) [0x55dd44ffc0c5]
 mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x697) [0x55dd44eee5d7]
 mongod(+0x7F480C) [0x55dd445a880c]
 mongod(main+0x96B) [0x55dd445c707b]
 libc.so.6(__libc_start_main+0xF3) [0x7fce066010b3]
 mongod(+0x86DC41) [0x55dd44621c41]
-----  END BACKTRACE  -----
Aborted (core dumped)

stephan · November 17, 2021, 7:45pm

Hi @pgoetz,

Sorry to hear the repair command isn’t working. There’s not really much you can do to get around situations like these unless you keep daily backups.

Importing your projects would be your best bet. I’d recommend making a backup of every single project before you start.
You lose users, but also worker node connections (“lanes”). You can easily re-connect them by running the cryosparcw connect command.

Before you do all that, you can first just specify a new database folder in cryosparc_master/config.sh, no need to reinstall cryoSPARC. When you run cryosparcm start the process supervisor will spawn MongoDB, which will detect an empty folder and initialize a new database directory. Once cryoSPARC finishes turning on, you’ll see an empty instance. At this point, you can start importing each of your projects (keep backups of all your projects since you don’t have a database yet!), create user accounts, connect worker nodes, and set up a cron job to execute database backups.

pgoetz · November 17, 2021, 9:28pm

Thanks. I ended up emptying the cryosparc-database folder (we have a backup of this, but it’s useless) and we’ve started the process of recreating users and re-importing jobs. I’m going to do daily DB backups after this.