The https interface does not load with this status.
This is a 4.2.1. CS instance that is running ona HPC cluster with SLURM scheduler, filesystem is ceph (I will come back to this).
This sometimes happens randomly but if we start 5+ jobs it happens almost for sure in minutes. Quite regularly it also happens during scheduled database backup.
If I just restart CS (cryosparcm restart or cryosparcm restart database) the database will exit in seconds or minutes. It can be repaired with mongod --dbpath ./ --repair from inside the database folder.
But this is getting tedious and we are also losing processing time… Months ago this kind of process actually led to the database being corrupted beyond repair and we started fresh with a new database. The same issues popped up again quite soon with the fresh database.
At this time my (not an IT person) suspect is ceph filesystem. For years CS ran just fine on this cluster but then this started to happen (not sure if there was some update or whatever).
We have CS running on a different cluster which also uses ceph but there it is rock solid.
I have logs if this could help pinpoint the issue. Any ideas?
How can we approach this? I have recently also contacted IT support at the HPC but we are just starting the troubleshooting.
Any one with similar experience?
I am sorry to learn about the persistent database problems you are experiencing.
Please can you further describe your setup:
Are cryosparc_master processes (and therefore the mongo database) subject to SLURM job management, or do they run independently of SLURM, but submit jobs to a SLURM partition?
Do cryosparc_master processes run on “strained” infrastructure with significant contention for resources like CPU, RAM, network, storage access
Under some circumstances that I cannot clearly define, cryosparcm stop, which is part of the cryosparcm restart routine, fails to terminate cryosparc_worker processes. If you frequently experience restart problems, you may want to use a sequence of
cryosparcm stop
ps -eo pid,ppid,cmd | grep -e cryosparc -e mongo # (to confirm no CryoSPARC related processes remain)
cryosparcm start
to confirm that this isn’t the problem in your case.
Be aware of the characterization of mongod --repair as a last resort.
Be sure to mention the timeline of the errors’ emergence to your IT support.
Interesting datapoint. Can you identify any differences between this and the other cluster that could be relevant?
Did you check /path/to/cryosparc_master/run/database.log
(cryosparcm log database)
to see what actual errors led to database exits?
@wtempel thank you for the quick response and I apologise for the late reply. Unfortunately the situation has not yet improved. In the meantime we updated to CS 4.3.0 but the issue persists.
cryosparc_master processes run on a virtual host on the cluster and this virtual host is not subject to SLURM. The virtual host and CS is always on and available. The cryosparc_worker processes are the queued to the cluster via SLURM.
The hardware that the virtual OS for cryosparc_master processes is running, from what we have been told, is not strained. And the database errors happen at different times, during the day, during the night, during holidays…
cryosparcm stop seems to work. There are no cryosparc processes left running after this command.
Yeah I realize that --repair is the last resort but we have to use it every other day to keep using this CS instance.
From our point of view both clusters use ceph but other than that we don’t have any specific info what are the differences.
Yes I have checked the log multiple times. Here is an example from today. It is too long to paste here so it is available on the link above. I just cut out a sections that starts about 10 minutes before the error and the whole error.
In v4.3.0, you can enable database journalling by specifying
export CRYOSPARC_MONGO_EXTRA_FLAGS=" "
Note the space between the quotes.
inside cryosparc_master/config.sh and subsequently restarting CryoSPARC.
Journalling should make your database more resilient to disruption and consequent corruption.
thank you for the suggestion.
I noticed the new MongoDB variable in the latest release notes for 4.3.0 but haven’t enabled it yet.
A couple of days ago the cluster admin updated the OS of the CS master host server. I’ll wait for a while to see how that works out.
Are there any downsides of using this variable?
export CRYOSPARC_MONGO_EXTRA_FLAGS=" "
Do you plan to enable it for everyone in the future releases?
the error is back 6 days after the cryosparc_master host update.
The cluster admin says that there are no obvious networking or ceph errors that could explain the cryosparc database exit.
I suppose it is time to test the CRYOSPARC_MONGO_EXTRA_FLAGS.