Almost regular occurence of "database EXITED"

Dear all,

in the last 12 months we quite regularly experience error: database EXITED

command_vis RUNNING pid 217756, uptime 3 days, 2:26:10
database EXITED Jun 20 03:57 PM


License is valid

I see this reported when I run

cryosparcm status

The https interface does not load with this status.

This is a 4.2.1. CS instance that is running ona HPC cluster with SLURM scheduler, filesystem is ceph (I will come back to this).

This sometimes happens randomly but if we start 5+ jobs it happens almost for sure in minutes. Quite regularly it also happens during scheduled database backup.

If I just restart CS (cryosparcm restart or cryosparcm restart database) the database will exit in seconds or minutes. It can be repaired with mongod --dbpath ./ --repair from inside the database folder.

But this is getting tedious and we are also losing processing time… Months ago this kind of process actually led to the database being corrupted beyond repair and we started fresh with a new database. The same issues popped up again quite soon with the fresh database.

At this time my (not an IT person) suspect is ceph filesystem. For years CS ran just fine on this cluster but then this started to happen (not sure if there was some update or whatever).

We have CS running on a different cluster which also uses ceph but there it is rock solid.

I have logs if this could help pinpoint the issue. Any ideas?

How can we approach this? I have recently also contacted IT support at the HPC but we are just starting the troubleshooting.
Any one with similar experience?

I am sorry to learn about the persistent database problems you are experiencing.

Please can you further describe your setup:

  1. Are cryosparc_master processes (and therefore the mongo database) subject to SLURM job management, or do they run independently of SLURM, but submit jobs to a SLURM partition?
  2. Do cryosparc_master processes run on ā€œstrainedā€ infrastructure with significant contention for resources like CPU, RAM, network, storage access

Under some circumstances that I cannot clearly define,
cryosparcm stop, which is part of the cryosparcm restart routine, fails to terminate cryosparc_worker processes. If you frequently experience restart problems, you may want to use a sequence of

  1. cryosparcm stop
  2. ps -eo pid,ppid,cmd | grep -e cryosparc -e mongo # (to confirm no CryoSPARC related processes remain)
  3. cryosparcm start

to confirm that this isn’t the problem in your case.

Be aware of the characterization of mongod --repair as a last resort.

Be sure to mention the timeline of the errors’ emergence to your IT support.

Interesting datapoint. Can you identify any differences between this and the other cluster that could be relevant?

Did you check /path/to/cryosparc_master/run/database.log
(cryosparcm log database)
to see what actual errors led to database exits?

@wtempel thank you for the quick response and I apologise for the late reply. Unfortunately the situation has not yet improved. In the meantime we updated to CS 4.3.0 but the issue persists.

  1. cryosparc_master processes run on a virtual host on the cluster and this virtual host is not subject to SLURM. The virtual host and CS is always on and available. The cryosparc_worker processes are the queued to the cluster via SLURM.
  2. The hardware that the virtual OS for cryosparc_master processes is running, from what we have been told, is not strained. And the database errors happen at different times, during the day, during the night, during holidays…
  3. cryosparcm stop seems to work. There are no cryosparc processes left running after this command.
  4. Yeah I realize that --repair is the last resort but we have to use it every other day to keep using this CS instance.
  5. From our point of view both clusters use ceph but other than that we don’t have any specific info what are the differences.
  6. Yes I have checked the log multiple times. Here is an example from today. It is too long to paste here so it is available on the link above. I just cut out a sections that starts about 10 minutes before the error and the whole error.

I can send any logs that might help.

@eMKiso Thanks for providing these details.
Please can you provide additional details on the virtual host running cryosparc_master processes:

  1. The virtualization platform: container? vm? implementation type?
  2. output for these commands on the CryoSPARC master host in a new shell:
    uname -a 
    free -g
    eval $(cryosparcm env)
    stat -f $CRYOSPARC_DB_PATH
    df -h $CRYOSPARC_DB_PATH
    exit
    

Hi @wtempel
I have sent you the details via private message.

Best!

In v4.3.0, you can enable database journalling by specifying

export CRYOSPARC_MONGO_EXTRA_FLAGS=" "

Note the space between the quotes.
inside cryosparc_master/config.sh and subsequently restarting CryoSPARC.
Journalling should make your database more resilient to disruption and consequent corruption.

Hi @wtempel,

thank you for the suggestion.
I noticed the new MongoDB variable in the latest release notes for 4.3.0 but haven’t enabled it yet.
A couple of days ago the cluster admin updated the OS of the CS master host server. I’ll wait for a while to see how that works out.

Are there any downsides of using this variable?

export CRYOSPARC_MONGO_EXTRA_FLAGS=" "

Do you plan to enable it for everyone in the future releases?

Best!

Hi,

the error is back 6 days after the cryosparc_master host update.
The cluster admin says that there are no obvious networking or ceph errors that could explain the cryosparc database exit.

I suppose it is time to test the CRYOSPARC_MONGO_EXTRA_FLAGS.

Best!

Any solution? I recently got the same problem too. Thanks!

Found the reason for my case: it is because the qroupquota reached the limit, and increasing the quota limit solved the problem.

Bests!
Xing

2 Likes

Now we are testing the CRYOSPARC_MONGO_EXTRA_FLAGS.
Since we enabled this option we had zero cases of database EXITED errors.

For now it seems that this really helps!
I sure hope it stays like that. :grinning:

Well after 16 days it it happened again.
So it seems that CRYOSPARC_MONGO_EXTRA_FLAGS helps but may not be the perfect solution.

It may be just a bad combination of cluster properties and cryosparc (mogodb) software…

To revive this thread, we are still having issues with database corruption on one HPC cluster. Still running CS 4.7.1.

Every month or two the database gets into the state EXITED and in most cases we can repair it with the mongod –repair command. In other cases we need to restore the database from a backup.

Would upgrading to CS 5 help? Where there any changes in regards to database management?

We had similar issues on different cluster (both use CEPH) and we were able to solve the issue by transferring the database to a local disk on the server where the ā€˜CS master’ is running.

A separate but perhaps connected CEPH issue. We also see errors when we try to use SSD cache that is also mounted via CEPH. In most cases jobs fail because of I/O issues and we just don’t use SSD cache on this cluster for this reason.

It is just my feeling, but the more parallel jobs we are running (CEPH gets more congested) more likely it is to fail. But sometimes we have 10+ parallel jobs running and all works just fine (except the SSD, which always fails if there is more than 1 CS job using SSD cache).

Any ideas, can we do something to improve the situation?

The problem description and the apparent resolution of this issue on a different cluster by moving the database to local storage suggest that the CEPH filesystem in its current configuration is unsuitable for storing the database. Please review these requirements that apply to mongodb on non-local storage. An upgrade to CryoSPARC v5 is unlikely to resolve the problem with database storage.

We don’t officially support particle caching on network file systems like CEPH, so we can’t guarantee a fix. Could you please share the error reports for jobs that failed because they used SSD cache mounted via CEPH? If you can share the reports via your organization’s file sharing platform, you may send me a link via a personal message. We also can make alternative sharing arrangements, if needed. Please let me know.

A related question: On what type of filesystem are CryoSPARC project directories stored?

Thanks for the quick reply.

I’ll get back to you in a couple of days with the CEPH SSD cache errors.

Project directories are also on CEPH. The clusters we use are both exclusively CEPH based.

Improved performance through caching can be expected only if particles can be read from cache much faster than they can be read from the project directory. This might be the case, for example, if cache is implemented on an SSD that is directly attached the PCIe of the GPU worker and projects are stored on networked storage.
If both cache and project storage have similar performance, caching might just add the overhead of preparing the cache, without accelerating particle reads during classification, reconstruction or refinement. When both cache and project storage are implemented on CEPH, one should probably disable caching.

Hi,

I agree that SSD cache over network doesn’t make a big difference.

The idea in our case was to offload the HDD disks by reading the data only once from the HDD array. Once copied to the SSD network array all the subsequent jobs read the data from the shared network SSD cache. So no need to copy to data to a different cluster node for each job. Allegedly the connections are super fast and relatively low latency low (sure, still worse than a local SSD on a cluster node).

I sent you the error reports via personal message.