Hardware failure caused database corruption

pgoetz · November 9, 2021, 6:23pm

The RAID controller on our cryosparc master server crashed, leaving the database in a corrupted state. From a previous bout with this, I’m aware that the steps to recovery are

make a backup copy of the database directory
cd database directory
mongod --dbpath . --repair

In attempting to back up the database, I noticed it’s taking up almost 1TB of space, which seems not right. On further investigation it seem the folder contains a couple of very large backup files that I didn’t create:

cryosparc_user@cerebro:~/cryosparc_database$ pwd
/local/home/cryosparc_user/cryosparc_database
cryosparc_user@cerebro:~/cryosparc_database$ ls -lh backup
total 413G
-rw-rw-r-- 1 cryosparc_user cryosparc_user 206G Jun 12 19:15 cryosparc_backup_2021_06_12_14h03.archive
-rw-rw-r-- 1 cryosparc_user cryosparc_user 207G Jun 15 19:02 cryosparc_backup_2021_06_15_13h51.archive

Is there some process which would have created these files automatically?
Second, one of the .wt files looks to be out of control:

rw-r--r-- 1 cryosparc_user cryosparc_user  339G Nov  6 18:19 collection-36--7747720921166270324.wt

I’m guessing the controller crashed while cryosparc was attempting to write something, started spinning its wheels, and filled the database with cruft before we could catch it.

So, my question is just proceed with the prescribed recovery method? Backup up the current database folder is going to take a a lot of time under the circumstances.

Any other suggestions? Understood that this is kind of a worst case scenario for software.

pgoetz · November 9, 2021, 8:44pm

Looks like the mongodb rebuild is failing. Any thoughts on anything else I can try?

cryosparc_user@cerebro:~$ mongod --dbpath ./cryosparc_database --repair
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] MongoDB starting : pid=22382 port=27017 dbpath=./cryosparc_database 64-bit host=cerebro
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] db version v3.4.10
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] git version: 078f28920cb24de0dd479b5ea6c66c644f6326e9
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] modules: none
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] build environment:
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten]     distarch: x86_64
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2021-11-09T20:38:21.847+0000 I CONTROL  [initandlisten] options: { repair: true, storage: { dbPath: "./cryosparc_database" } }
2021-11-09T20:38:21.847+0000 W -        [initandlisten] Detected unclean shutdown - ./cryosparc_database/mongod.lock is not empty.
2021-11-09T20:38:21.876+0000 I -        [initandlisten] Detected data files in ./cryosparc_database created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2021-11-09T20:38:21.876+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2021-11-09T20:38:21.876+0000 I STORAGE  [initandlisten]
2021-11-09T20:38:21.876+0000 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2021-11-09T20:38:21.876+0000 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2021-11-09T20:38:21.876+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=47147M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),            checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),,log=(enabled=false),
2021-11-09T20:38:21.894+0000 E STORAGE  [initandlisten] WiredTiger error (0) [1636490301:894083][22382:0x7f3e12e73d00], file:WiredTiger.wt, connection: read checksum error for 4096B block at offset 290816: block header checksum of 1030775143 doesn't match expected checksum of 664109259
2021-11-09T20:38:21.894+0000 E STORAGE  [initandlisten] WiredTiger error (0) [1636490301:894117][22382:0x7f3e12e73d00], file:WiredTiger.wt, connection: WiredTiger.wt: encountered an illegal file format or internal value
2021-11-09T20:38:21.894+0000 E STORAGE  [initandlisten] WiredTiger error (-31804) [1636490301:894132][22382:0x7f3e12e73d00], file:WiredTiger.wt, connection: the process must exit and restart: WT_PANIC: WiredTiger library panic
2021-11-09T20:38:21.894+0000 I -        [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2021-11-09T20:38:21.894+0000 I -        [initandlisten]

***aborting after fassert() failure

pgoetz · November 10, 2021, 4:34pm

OK, giving up on getting a response here. Can someone tell me which version of MongoDB is included with the current version of cryoSPARC?

ccgauvin94 · November 10, 2021, 5:01pm

mongod --version should get you your version of MongoDB.

Re: your actual data recovery problem, this looks pretty complex, and the RAID controller looks like it did something it’s never supposed to do. The first thing I would do is check free space on drives. I’ve seen that crash RAID controllers before.

What type of RAID are you running?

pgoetz · November 10, 2021, 6:05pm

It’s a RAID6 using a standard LSI controller from Thinkmate. There’s plenty of free space in the partition; the controller diagnostics indicate an unrecoverable ECC memory error. Pretty sure that database is going to be unrecoverable from that. Moving forward, I need to start backing this up aggressively, and will post a separate ticket about that.