Huge Database Size and Migration to Multiple Instances?

Dear CryoSPARC developers & admins,

our current CryoSPARC installation has become quite big with ~30 users, ~300 projects, ~30000 jobs and 1.5 TB (!) database size. I am especially worried about the huge size of the database and whether it has potential impacts on performance and stability. Compaction using db.runCommand({compact:…}) did not help, and due to the size, I make only weekly backups of the database. In addition, our master node is also a worker node, which leads sometimes to instabilities when resource-hungry jobs run on the master node.

So, we are currently thinking about:

  1. Buying a new master node with 24 cores, 128 GB RAM and a big SSD for local scratch, which will not be a worker node

  2. Maybe, migration to three new smaller CryoSPARC instances via separate technical user accounts cryosparc1-3, separate starting ports 41000, 42000, 43000, and using the new job detach/attach function

My questions to you are:

  1. Is the current database size of 1.4 TB really a potential problem for performance and stability, or is it simply a huge database?

  2. Would splitting of our big CryoSPARC instance into three smaller ones really make sense or would it be just more work and would be more complicated with no expected benefit in performance and stability?

  3. Are the specs for the master node reasonable or too high?

I am looking forward to your answers!

Best regards,

Dirk

Thanks for your post @dirk.

A dedicated master node is a good idea. Your experience with the current user base and hardware should be a good guide of how much RAM and CPU resources you need, as long as you allow some room for growth. What would be the use case for the

?

has already been answered by yourself:

Splitting up your instance is a plausible approach, particularly if you pick good criteria along which to split the instance. Some ideas:

  1. People who need to share projects need to have CryoSPARC logins on the same instance.
  2. Instance could be split based on project lifecycle:
  3. Older, inactive projects could be hosted on a dedicated “legacy” instance that does not have any attached workers. Such projects could be in archived state. Such a “legacy” instance’s database would have be backed up after all relevant inactive projects have been added, but due to the immutability of the projects additional database backups would not be needed. A large database on such a “legacy” instance would therefore reduce the administrative burden, compared to a large database on an “active” instance (below).
  4. An “active” instance’s database would be backed up frequently. The “active” instance would be kept small and agile by:
    * detachment of inactive projects, their removal from the database and, possibly, transfer of the detached project directory to a “legacy” instance.
    • database compaction after some data have been removed from the database
    • application of the data cleanup tool to active projects. Careful:
      1. Do not blindly accept the tool’s default settings
      2. Understand the difference between final and completed jobs
  5. One could combine criteria for splitting a large instance.
  6. Create and manage multiple CryoSPARC instances efficiently. For a discussions on what works and what doesn’t, see, for example
2 Likes

Dear Wolfram,

many thanks for your reply and very helpful suggestions on how to setup multiple instances!

Data management and cleanup have really improved since version 4 - I will look at these guides more deeply.

However, I still have two questions:

Since I don’t have an overview of all jobs that may run on the master node, I wanted to be prepared if any such job would need a local scratch disk. Does your reply mean, that a local SSD for scratch is not needed on the master node?

Well, I know, that backups of the huge database are huge, too. However, my important question was, whether such a huge database has potential problems with performance and stability?

Best regards,

Dirk

It does. Job types that are restricted to running on the master host do not currently use particle caching.
I cannot rule out the possibility of potential

That said, our team expects that a 1.4 TB database can run with reasonable performance and stability, if one could strictly separate stability from resilience against “unfortunate” events. The size of the backup per se is not the only concern. A potentially bigger concern is that either the duration of the backup operation or the size of each backup may prevent frequent backups, which in turn may increase the disruption should an “unfortunate” event occur a long time after the latest backup.