Backing up the MongoDB database

pgoetz · November 10, 2021, 6:28pm

Given that we just experienced what appears to be unrecoverable mongo database corruption, I’m keen to start aggressively backing up the Mongo DB. What I have in my notes is:

To backup the MongoDB database,
$ cryosparcm backup

Is this the correct command? Presumably I need to stop cryosparc before running this?

Finally, do the backups go to ~/cryospard-database/backup by default? Is there a way to have them go elsewhere?

stephan · November 10, 2021, 6:36pm

Hi @pgoetz,

Please check out this section of the guide for detailed instructions:
https://guide.cryosparc.com/setup-configuration-and-management/management-and-monitoring/cryosparcm#cryosparcm-backup

user123 · November 12, 2021, 7:40pm

Is it necessary to also backup the raw data in the /cryosparc_database/project/ directory? It’s not quite clear to me what is happening with the cryosparcm backup command and how or whether this can be used to restore in the case that the hard drive containing cryospsarc_database fails. We don’t have a raid system and I periodically rsync the cryosparc_database to an external and also save the cryosparcm backup archive file. Is this enough to restore and recover processing pathways in the case that the main drive is lost? Is there a better failsafe backup solution?

Sorry to hear about your loss, @pgoetz - hopefully the data can be recovered or at least replicated from the raw data

pgoetz · November 12, 2021, 8:21pm

I think it can all be replicated, it’s just a function of time. We run many, many cryosparc jobs on multiple worker nodes. I do have a question about using wt to recover the Mongo databse.as per some stuff I’ve read online, but will turn that into a separate post.

I think the answer to your question is that you do need to back up your raw data if it’s important to be preserved, but my understanding is that cryosparcm backup just backs up the Mongo database. How they’re able to do this and maintain data integrity without stopping cryosparc is a mystery (or perhaps cryosparc is paused during the backup). This would be an issue for us, as we’re running multiple cryosparc jobs 24/7.

stephan · November 18, 2021, 3:50pm

Hi @user123, @pgoetz,

Please note cryoSPARC v3.2.0 uses MongoDB v3.4.10.

This is ideal if you’d like to keep a backup of the results that cryoSPARC has created (e.g., micrographs, particles, 3D volumes), but if you only want to keep a backup of the database (users, lanes, projects, workspace and job metadata, plots, event logs, etc.,), the cryosparcm backup function is enough.

If the drive containing the cryoSPARC database folder is lost, but you have an rsync'd backup of it, you can restore your instance by specifying the path to the backup database directory in cryosparc_master/config.sh : CRYOSPARC_DB_PATH, then restarting cryoSPARC.

The cryosparcm backup function is a wrapper to mongodump. You can read about it here. Relevant excerpts:

mongodump and mongorestore operate against a running mongod process, and can manipulate the underlying data files directly.

When connected to a MongoDB instance, mongodump can adversely affect mongod performance. If your data is larger than system memory, the queries will push the working set out of memory, causing page faults.

Also, a note about making backups of the database using cp or rsync:

Since copying multiple files is not an atomic operation, you must stop all writes to the mongod before copying the files. Otherwise, you will copy the files in an invalid state.

You can “stop all writes” by turning off cryoSPARC before you make a copy: cryosparcm stop.

user123 · November 18, 2021, 10:52pm

Really helpful explanation! Thank you.

user123 · November 22, 2021, 1:14am

HI @stephan, is it absolutely necessary to stop cryosparc before starting rsync if I want to transfer one project from this database to another system? Even if no jobs are running in the project undergoing rsync?
Also, can I transfer just one (very large) project to a new system with this rsync’d database and the cryosparcm backup file (pointing config.sh CRYOSPARC_DB_PATH to the rsync’d cryosparc_database/project/Pxx)? I don’t particularly need to transfer the other projects (but don’t mind doing so if this is necessary to make it work). I’m not sure I even need the cryosparcm backup file for this. Can’t I just point CRYOSPARC_DB_PATH of the new system to the newly rsync’d single project directory and start cryosparc and work there?