Almost regular occurence of "database EXITED"

eMKiso · June 20, 2023, 8:20pm

Dear all,

in the last 12 months we quite regularly experience error: database EXITED

command_vis RUNNING pid 217756, uptime 3 days, 2:26:10
database EXITED Jun 20 03:57 PM

License is valid

I see this reported when I run

cryosparcm status

The https interface does not load with this status.

This is a 4.2.1. CS instance that is running ona HPC cluster with SLURM scheduler, filesystem is ceph (I will come back to this).

This sometimes happens randomly but if we start 5+ jobs it happens almost for sure in minutes. Quite regularly it also happens during scheduled database backup.

If I just restart CS (cryosparcm restart or cryosparcm restart database) the database will exit in seconds or minutes. It can be repaired with mongod --dbpath ./ --repair from inside the database folder.

But this is getting tedious and we are also losing processing time… Months ago this kind of process actually led to the database being corrupted beyond repair and we started fresh with a new database. The same issues popped up again quite soon with the fresh database.

At this time my (not an IT person) suspect is ceph filesystem. For years CS ran just fine on this cluster but then this started to happen (not sure if there was some update or whatever).

We have CS running on a different cluster which also uses ceph but there it is rock solid.

I have logs if this could help pinpoint the issue. Any ideas?

How can we approach this? I have recently also contacted IT support at the HPC but we are just starting the troubleshooting.
Any one with similar experience?

wtempel · June 20, 2023, 9:49pm

I am sorry to learn about the persistent database problems you are experiencing.

Please can you further describe your setup:

Are cryosparc_master processes (and therefore the mongo database) subject to SLURM job management, or do they run independently of SLURM, but submit jobs to a SLURM partition?
Do cryosparc_master processes run on “strained” infrastructure with significant contention for resources like CPU, RAM, network, storage access

Under some circumstances that I cannot clearly define,
cryosparcm stop, which is part of the cryosparcm restart routine, fails to terminate cryosparc_worker processes. If you frequently experience restart problems, you may want to use a sequence of

cryosparcm stop
ps -eo pid,ppid,cmd | grep -e cryosparc -e mongo # (to confirm no CryoSPARC related processes remain)
cryosparcm start

to confirm that this isn’t the problem in your case.

Be aware of the characterization of mongod --repair as a last resort.

Be sure to mention the timeline of the errors’ emergence to your IT support.

Interesting datapoint. Can you identify any differences between this and the other cluster that could be relevant?

Did you check /path/to/cryosparc_master/run/database.log
(cryosparcm log database)
to see what actual errors led to database exits?

eMKiso · August 24, 2023, 4:57am

@wtempel thank you for the quick response and I apologise for the late reply. Unfortunately the situation has not yet improved. In the meantime we updated to CS 4.3.0 but the issue persists.

cryosparc_master processes run on a virtual host on the cluster and this virtual host is not subject to SLURM. The virtual host and CS is always on and available. The cryosparc_worker processes are the queued to the cluster via SLURM.
The hardware that the virtual OS for cryosparc_master processes is running, from what we have been told, is not strained. And the database errors happen at different times, during the day, during the night, during holidays…
cryosparcm stop seems to work. There are no cryosparc processes left running after this command.
Yeah I realize that --repair is the last resort but we have to use it every other day to keep using this CS instance.
From our point of view both clusters use ceph but other than that we don’t have any specific info what are the differences.
Yes I have checked the log multiple times. Here is an example from today. It is too long to paste here so it is available on the link above. I just cut out a sections that starts about 10 minutes before the error and the whole error.

I can send any logs that might help.

wtempel · August 24, 2023, 4:14pm

@eMKiso Thanks for providing these details.
Please can you provide additional details on the virtual host running cryosparc_master processes:

The virtualization platform: container? vm? implementation type?

output for these commands on the CryoSPARC master host in a new shell:

uname -a 
free -g
eval $(cryosparcm env)
stat -f $CRYOSPARC_DB_PATH
df -h $CRYOSPARC_DB_PATH
exit

eMKiso · August 24, 2023, 6:13pm

Hi @wtempel
I have sent you the details via private message.

Best!

wtempel · August 30, 2023, 6:07pm

In v4.3.0, you can enable database journalling by specifying

export CRYOSPARC_MONGO_EXTRA_FLAGS=" "

Note the space between the quotes.
inside cryosparc_master/config.sh and subsequently restarting CryoSPARC.
Journalling should make your database more resilient to disruption and consequent corruption.

eMKiso · August 30, 2023, 7:45pm

Hi @wtempel,

thank you for the suggestion.
I noticed the new MongoDB variable in the latest release notes for 4.3.0 but haven’t enabled it yet.
A couple of days ago the cluster admin updated the OS of the CS master host server. I’ll wait for a while to see how that works out.

Are there any downsides of using this variable?

export CRYOSPARC_MONGO_EXTRA_FLAGS=" "

Do you plan to enable it for everyone in the future releases?

Best!

eMKiso · September 1, 2023, 3:58pm

Hi,

the error is back 6 days after the cryosparc_master host update.
The cluster admin says that there are no obvious networking or ceph errors that could explain the cryosparc database exit.

I suppose it is time to test the CRYOSPARC_MONGO_EXTRA_FLAGS.

Best!

xzhang2017 · September 18, 2023, 12:56am

Any solution? I recently got the same problem too. Thanks!

xzhang2017 · September 18, 2023, 3:03am

Found the reason for my case: it is because the qroupquota reached the limit, and increasing the quota limit solved the problem.

Bests!
Xing

eMKiso · September 18, 2023, 7:18pm

Now we are testing the CRYOSPARC_MONGO_EXTRA_FLAGS.
Since we enabled this option we had zero cases of database EXITED errors.

For now it seems that this really helps!
I sure hope it stays like that.

eMKiso · September 19, 2023, 7:54pm

Well after 16 days it it happened again.
So it seems that CRYOSPARC_MONGO_EXTRA_FLAGS helps but may not be the perfect solution.

It may be just a bad combination of cluster properties and cryosparc (mogodb) software…