SSD cache waiting to be unlocked & optimized performance

@MHB thanks for this suggestion, we’ve added it to our tracker and should have it out for an upcoming release.

Great thanks! Saves me creating cron jobs…

I am running into the large particle stack size issue again.
Running v3.1

A particle stack after 2D classification and selection give the following when trying to use the cache in a helical refnement

[CPU: 518.3 MB]  Using random seed of 1054570268
[CPU: 518.3 MB]  Loading a ParticleStack with 34559 items...
[CPU: 518.5 MB]   SSD cache : cache successfuly synced in_use
[CPU: 519.3 MB]   SSD cache : cache successfuly synced, found 166958.02MB of files on SSD.
[CPU: 519.3 MB]   SSD cache : cache successfuly requested to check 6645 files.
[CPU: 521.6 MB]  Detected file change due to change in file size.
[CPU: 520.1 MB]   SSD cache : cache requires 1867125.75MB more on the SSD for files to be downloaded.
[CPU: 520.1 MB]   SSD cache : cache does not have enough space for download
[CPU: 521.1 MB]   SSD cache :   but there are files that can be deleted, deleting...
[CPU: 521.6 MB]   SSD cache : cache does not have enough space for download
[CPU: 520.4 MB]   SSD cache :   but there are no files that can be deleted. 
[CPU: 520.4 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD.

i.e. apparent size of 1.8Tb!

If re-extract the same particle stack with the same parameters now i get this

[CPU: 519.1 MB]  Using random seed of 764820665
[CPU: 519.1 MB]  Loading a ParticleStack with 34559 items...
[CPU: 519.2 MB]   SSD cache : cache successfuly synced in_use
[CPU: 519.3 MB]   SSD cache : cache successfuly synced, found 102962.49MB of files on SSD.
[CPU: 519.4 MB]   SSD cache : cache successfuly requested to check 6645 files.
[CPU: 521.0 MB]   SSD cache : cache requires 11871.38MB more on the SSD for files to be downloaded.
[CPU: 521.0 MB]   SSD cache : cache has enough available space.

Now only 11Gb…reasonable size

I could not get the script you pasted above to run to complete the reqeusted diagnostic

Hi @MHB, please send me the following information for further troubleshooting:

  1. How many particles went into the original 2D Classification job you used to filter these out?
  2. Were the particles sourced from outside of cryoSPARC or via an “Import Particles Job”?
  3. Were the particles extracted in a job that run the update from v3.1?
  4. What is the size on disk of the “Extract from Micrographs” or “Import Particle Stack” job used to originally extract these particles (select the job and look in the sidebar)
  5. Send me the .cs file for the outputs of the parent job where the refinements were sourced for helical refinement (e.g., the particles_selected output of Select 2D job). Download it from the job’s Outputs tab (see screenshot). Feel free to use a file-sharing service and direct-message me the link.

These came from imported motioncorr2 micrographs. Here are the steps every step was v3.1 except patchCTF

  1. patch ctf done in v3.0
  2. template picker
    3.extracxt 11611488 particles
  3. 3 rounds of 2D classification without SSD
  4. Helix refine or 2d classification reports 1867125.75MB SSD required
  5. extract the particles and now 11871.38MB SSD requires

Original extract job size 3.81 TB

will send .cs file

Hi All, I got the same issue with NU Refinement v3.1.0 and my job can’t run. I did not have other jobs running at the same time.

Importing job module for job type nonuniform_refine_new…
[CPU: 534.3 MB] Job ready to run
[CPU: 534.3 MB] ***************************************************************
[CPU: 1.18 GB] Using random seed of 1136935870
[CPU: 1.18 GB] Loading a ParticleStack with 547608 items…
[CPU: 1.19 GB] SSD cache : cache successfuly synced in_use
[CPU: 1.19 GB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
[CPU: 1.19 GB] SSD cache : requested files are locked for past 709s, checking again in 5s

Hi nfrasser,

I have the similar trouble when running 2d classification after I updated the CS from 3.1 to 3.2.

Best,
Chuchu

1 Like

Hi @nfrasser,

I too am having this problem with 3.2. In my case it basically happens with any job type I try.

Thanks so much,
KIW

Hi @kiwhite @MHB we recently released a patch for v3.2 with a fix for this. Please install it with the following instructions: https://guide.cryosparc.com/setup-configuration-and-management/software-updates#apply-patches

@CleoShen this means there is not enough space on the SSD to cache the required particles. You may have to either (A) manually free up space used by other applications, (B) reconfigure your installation to use a larger SSD or ( C ) disable SSD caching in the job parameters before queuing it.

Great thanks! Will install and see how things work.

Hi nfrasser,

Thank you for your kindly reply. I selected C solution and the SSD cache erre was fixed, after I updating the patched, the 2D classification has new error, as below; do you have any suggestions?

@nfrasser okay great, that did the trick. Thanks so much!

1 Like

@nfrasser I still have this problem with v3.2.0 for 2D classification.

License is valid.
Launching job on lane default target localhost …
Running job on master node hostname localhost
[CPU: 69.7 MB] Project P17 Job J8 Started
[CPU: 69.7 MB] Master running v3.2.0, worker running v3.2.0
[CPU: 69.7 MB] Running on lane default
[CPU: 69.7 MB] Resources allocated:
[CPU: 69.7 MB] Worker: localhost
[CPU: 69.7 MB] CPU : [0, 1]
[CPU: 69.7 MB] GPU : [0, 1]
[CPU: 69.7 MB] RAM : [0, 1, 2]
[CPU: 69.7 MB] SSD : True
[CPU: 69.7 MB] --------------------------------------------------------------
[CPU: 69.7 MB] Importing job module for job type class_2D…
[CPU: 196.0 MB] Job ready to run
[CPU: 196.0 MB] ***************************************************************
[CPU: 518.2 MB] Using random seed of 777984025
[CPU: 518.3 MB] Loading a ParticleStack with 481295 items…
[CPU: 522.1 MB] SSD cache : cache successfuly synced in_use
[CPU: 522.1 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
[CPU: 522.1 MB] SSD cache : requested files are locked for past 400s, checking again in 5s

@donghuachen did you also apply the patch? SSD cache waiting to be unlocked & optimized performance

@nfrasser I am not sure about the patch application because the cluster admin did the installation of v3.2.0.

You can tell whether the patch is installed if you see +210511 after the version number in cryoSPARC’s dashboard:

Hi,

I am currently trouble shooting an issue similar to the one discussed above ( in cryosparc v2.15). I have to say that cryosparc ran very smoothly over long time (>1 year) and only recently started to have these weird cache sync issues. I therefore suspect a local hardware issue (they pop up from time to time) or a problem related to the DB. Interestingly, we can reproduce this issue by running a NU-refinement which seems to break caching for an entire project. If we use a fresh project, everything runs fine until the point we run the NU again.

My question is: is it save to clear the cache_files collection in the meteor DB (and does it make sense)? It already collected ~5 million entries of files that are definitely not there anymore. I would of course also physically clear the cache SSD as well.

Upgrading cryosparc is not an intimidate option as we are planning to migrate to different hardware soon (and it seems like these issues still persist later on). I am rather looking for a temporary fix.

Best,
Chris

Hi @CD1

Thanks for your insight.
I’m not sure why Non-Uniform Refinement would be causing these issues- the same code path is used in all jobs when caching particles. Either way, I’ll take a look.

You can definitely do this as long as you’ve cleared all the files the actual cache location to avoid duplication. I’d also recommend creating a backup of your database just incase: https://guide.cryosparc.com/setup-configuration-and-management/management-and-monitoring/cryosparcm#cryosparcm-backup
Run the following commands in a shell on the master node to drop the cache_files collection:

eval $(cryosparcm env)
cryosparcm mongo
> db.cache_files.drop()

Once that’s done, restart cryoSPARC (cryosparcm restart) to recreate the cache_files collection and rebuild the indexes.

Hi @stephan ,

I’m on v3.3.1+211214 and seem to be having a similar problem…

Do you have any idea what could be going on/ if this might be a similar bug? Sometimes this has occurred when no other cryosparc job was running on the compute node in question

Thank you!

Hi,

I am experiencing the same issue as well with v3.3.1

Would be nice to get some suggestions