The https interface does not load with this status.
This is a 4.2.1. CS instance that is running ona HPC cluster with SLURM scheduler, filesystem is ceph (I will come back to this).
This sometimes happens randomly but if we start 5+ jobs it happens almost for sure in minutes. Quite regularly it also happens during scheduled database backup.
If I just restart CS (cryosparcm restart or cryosparcm restart database) the database will exit in seconds or minutes. It can be repaired with mongod --dbpath ./ --repair from inside the database folder.
But this is getting tedious and we are also losing processing time⦠Months ago this kind of process actually led to the database being corrupted beyond repair and we started fresh with a new database. The same issues popped up again quite soon with the fresh database.
At this time my (not an IT person) suspect is ceph filesystem. For years CS ran just fine on this cluster but then this started to happen (not sure if there was some update or whatever).
We have CS running on a different cluster which also uses ceph but there it is rock solid.
I have logs if this could help pinpoint the issue. Any ideas?
How can we approach this? I have recently also contacted IT support at the HPC but we are just starting the troubleshooting.
Any one with similar experience?
I am sorry to learn about the persistent database problems you are experiencing.
Please can you further describe your setup:
Are cryosparc_master processes (and therefore the mongo database) subject to SLURM job management, or do they run independently of SLURM, but submit jobs to a SLURM partition?
Do cryosparc_master processes run on āstrainedā infrastructure with significant contention for resources like CPU, RAM, network, storage access
Under some circumstances that I cannot clearly define, cryosparcm stop, which is part of the cryosparcm restart routine, fails to terminate cryosparc_worker processes. If you frequently experience restart problems, you may want to use a sequence of
cryosparcm stop
ps -eo pid,ppid,cmd | grep -e cryosparc -e mongo # (to confirm no CryoSPARC related processes remain)
cryosparcm start
to confirm that this isnāt the problem in your case.
Be aware of the characterization of mongod --repair as a last resort.
Be sure to mention the timeline of the errorsā emergence to your IT support.
Interesting datapoint. Can you identify any differences between this and the other cluster that could be relevant?
Did you check /path/to/cryosparc_master/run/database.log
(cryosparcm log database)
to see what actual errors led to database exits?
@wtempel thank you for the quick response and I apologise for the late reply. Unfortunately the situation has not yet improved. In the meantime we updated to CS 4.3.0 but the issue persists.
cryosparc_master processes run on a virtual host on the cluster and this virtual host is not subject to SLURM. The virtual host and CS is always on and available. The cryosparc_worker processes are the queued to the cluster via SLURM.
The hardware that the virtual OS for cryosparc_master processes is running, from what we have been told, is not strained. And the database errors happen at different times, during the day, during the night, during holidaysā¦
cryosparcm stop seems to work. There are no cryosparc processes left running after this command.
Yeah I realize that --repair is the last resort but we have to use it every other day to keep using this CS instance.
From our point of view both clusters use ceph but other than that we donāt have any specific info what are the differences.
Yes I have checked the log multiple times. Here is an example from today. It is too long to paste here so it is available on the link above. I just cut out a sections that starts about 10 minutes before the error and the whole error.
In v4.3.0, you can enable database journalling by specifying
export CRYOSPARC_MONGO_EXTRA_FLAGS=" "
Note the space between the quotes.
inside cryosparc_master/config.sh and subsequently restarting CryoSPARC.
Journalling should make your database more resilient to disruption and consequent corruption.
thank you for the suggestion.
I noticed the new MongoDB variable in the latest release notes for 4.3.0 but havenāt enabled it yet.
A couple of days ago the cluster admin updated the OS of the CS master host server. Iāll wait for a while to see how that works out.
Are there any downsides of using this variable?
export CRYOSPARC_MONGO_EXTRA_FLAGS=" "
Do you plan to enable it for everyone in the future releases?
the error is back 6 days after the cryosparc_master host update.
The cluster admin says that there are no obvious networking or ceph errors that could explain the cryosparc database exit.
I suppose it is time to test the CRYOSPARC_MONGO_EXTRA_FLAGS.
To revive this thread, we are still having issues with database corruption on one HPC cluster. Still running CS 4.7.1.
Every month or two the database gets into the state EXITED and in most cases we can repair it with the mongod ārepair command. In other cases we need to restore the database from a backup.
Would upgrading to CS 5 help? Where there any changes in regards to database management?
We had similar issues on different cluster (both use CEPH) and we were able to solve the issue by transferring the database to a local disk on the server where the āCS masterā is running.
A separate but perhaps connected CEPH issue. We also see errors when we try to use SSD cache that is also mounted via CEPH. In most cases jobs fail because of I/O issues and we just donāt use SSD cache on this cluster for this reason.
It is just my feeling, but the more parallel jobs we are running (CEPH gets more congested) more likely it is to fail. But sometimes we have 10+ parallel jobs running and all works just fine (except the SSD, which always fails if there is more than 1 CS job using SSD cache).
Any ideas, can we do something to improve the situation?
The problem description and the apparent resolution of this issue on a different cluster by moving the database to local storage suggest that the CEPH filesystem in its current configuration is unsuitable for storing the database. Please review these requirements that apply to mongodb on non-local storage. An upgrade to CryoSPARC v5 is unlikely to resolve the problem with database storage.
We donāt officially support particle caching on network file systems like CEPH, so we canāt guarantee a fix. Could you please share the error reports for jobs that failed because they used SSD cache mounted via CEPH? If you can share the reports via your organizationās file sharing platform, you may send me a link via a personal message. We also can make alternative sharing arrangements, if needed. Please let me know.
A related question: On what type of filesystem are CryoSPARC project directories stored?
Improved performance through caching can be expected only if particles can be read from cache much faster than they can be read from the project directory. This might be the case, for example, if cache is implemented on an SSD that is directly attached the PCIe of the GPU worker and projects are stored on networked storage.
If both cache and project storage have similar performance, caching might just add the overhead of preparing the cache, without accelerating particle reads during classification, reconstruction or refinement. When both cache and project storage are implemented on CEPH, one should probably disable caching.
I agree that SSD cache over network doesnāt make a big difference.
The idea in our case was to offload the HDD disks by reading the data only once from the HDD array. Once copied to the SSD network array all the subsequent jobs read the data from the shared network SSD cache. So no need to copy to data to a different cluster node for each job. Allegedly the connections are super fast and relatively low latency low (sure, still worse than a local SSD on a cluster node).
I sent you the error reports via personal message.