SSD cache waiting to be unlocked & optimized performance

orangeboomerang · May 10, 2020, 4:41pm

“cache waiting for requested files to become unlocked”

I know this is a common problem but it’d be really cool if it over-rides this error after a while and skips the SSD. Because often I’ll submit jobs at night and wake up and they will still not have started.

Also we are using a cluster with gpu boxes, but we’ve noticed that if 4 people submit jobs then usually they will all slow to a near halt. Especially if it’s an intense job like NU refinement. Is this an issue w/ cryosparc or is it the way we have our resources set up? Again, I’ll submit jobs at night but wake up and they are all “running” but still on iteration 0. Ugh.

Anyway, just some feedback! thanks

stephan · May 11, 2020, 3:25pm

Hey @orangeboomerang,

There is a bug where this happens endlessly in a specific case, but we’ve fixed it. The fix will be included in the next version of cryoSPARC (coming soon!).

Can you ensure that the workstation these jobs are being assigned to have enough RAM available? Sometimes the system swaps memory if it doesn’t have enough, and this takes a very long time to complete.
Also, during testing we came across a similar issue where there was a massive slowdown with jobs running at the same time, and we noticed that the system RAM was filled with OS filesystem cache.
We executed sync; echo 1 > /proc/sys/vm/drop_caches to drop the cache, and all jobs continued processing normally.

stephan · May 13, 2020, 10:45pm

Hi @orangeboomerang,

This bug has been fixed in the latest version of cryoSPARC, v2.15.0

olibclarke · May 14, 2020, 7:16pm

This does not seem fixed in my hands? I am still seeing it in 2.15 with non uniform refine jobs

olibclarke · May 14, 2020, 7:31pm

Actually in the new version I am seeing this with a single job, even after clearing the cache - is there some lock file that I need to delete?

nfrasser · May 14, 2020, 8:12pm

Hi @olibclarke, this definitely looks like a database inconsistency, can you try the following:

First kill every running cryoSPARC job.

In a terminal, enter cryosparcm mongo to run cryoSPARC’s interactive MongoDB shell.

Enter the following command to see how many cache files are locked.

db.cache_files.find({status: {$nin: ['hit', 'miss']}}).length()

If the printed number is greater than zero, then some files are incorrectly locked.

To unlock them, enter this command

db.cache_files.updateMany({status: {$nin: ['hit', 'miss']}}, {$set: {status: 'miss'}})

Then try running your jobs.

Note that you will still see the “cache waiting” message if there are multiple jobs trying to cache the same files, but it should now proceed once another job finishes caching. Do not kill cache-waiting jobs while this is happening.

Let me know if you run into any issues with this.

Nick

olibclarke · May 14, 2020, 8:46pm

Ok I will give that a go, thank you!! For now I am just running without caching, once those jobs are complete I will try again

Cheers
Oli

MHB · December 23, 2020, 12:29am

Recent upgrade to V3.0 and seeing a similar issue. Used a prior stack that i could load on the ssd and it just hangs like mentioned. Restarted cryosparc and now it thinks that the 100Gb stack is 3Tb!

****first run message888
[CPU: 81.6 MB] Importing job module for job type hetero_refine…
[CPU: 455.5 MB] Job ready to run
[CPU: 455.5 MB] ***************************************************************
[CPU: 651.7 MB] Using random seed of 329036760
[CPU: 651.7 MB] Loading a ParticleStack with 246037 items…
[CPU: 653.7 MB] SSD cache : cache successfuly synced in_use
[CPU: 653.7 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
[CPU: 654.5 MB] SSD cache : requested files are locked for past 97s, checking again in 5s

*second run after restart
[CPU: 81.6 MB] --------------------------------------------------------------
[CPU: 81.6 MB] Importing job module for job type hetero_refine…
[CPU: 457.5 MB] Job ready to run
[CPU: 457.5 MB] ***************************************************************
[CPU: 653.7 MB] Using random seed of 1010616078
[CPU: 653.7 MB] Loading a ParticleStack with 246037 items…
[CPU: 655.8 MB] SSD cache : cache successfuly synced in_use
[CPU: 655.8 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
[CPU: 658.9 MB] SSD cache : cache successfuly requested to check 16106 files.
[CPU: 663.1 MB] SSD cache : cache requires 3215182.60MB more on the SSD for files to be downloaded.
[CPU: 663.1 MB] SSD cache : cache does not have enough space for download
[CPU: 663.1 MB] SSD cache : but there are no files that can be deleted.
[CPU: 663.1 MB] SSD cache : This could be because other jobs are running and using files, or because a different program has used up space on the SSD.
[CPU: 663.1 MB] SSD cache : Waiting 30 seconds for space to become available…

nfrasser · January 4, 2021, 11:35pm

Hi @MHB, are you still getting this consistently? If yes, to help me further troubleshoot, please try the following:

Paste this code into a text editor, replacing <PROJECT ID HERE> with the project ID of the cryoSPARC project where you’re seeing this (e.g., P3)

import os
from cryosparc_compute.dataset import Dataset
project_uid = '<PROJECT ID HERE>'
partiles_file_path = '<PASTE PATH HERE>'
project_dir = cli.get_project_dir_abs(project_uid)
partiles = Dataset().from_file(partiles_file_path)
paths = set(partiles.data['blob/path'])
with open('particle-sizes-output.txt', 'w') as f:
    sizes = []
    for path in paths:
        abs_path = os.path.join(project_dir, path)
        size = os.path.getsize(abs_path)
        sizes += [size]
        print(f"{path} {size}", file=f)
    print(f"Total {sum(sizes)}", file=f)

In the cryoSPARC interface, open the parent job (e.g., Extract Particles) from which these particles come from (if there are multiple jobs do these steps for all of them)
Go the “Outputs” tab and look for the particles output group you connected to the hanging job
Copy the path to the particle.blob output from the cryoSPARC, as indicated

Screen Shot of particles.blob path copy button in the cryoSPARC interface1354×290 92.5 KB
In the text editor, replace <PASTE PATH HERE> with the path in your clipboard
In a command line with access to the cryosparcm command line interface, enter the following command:
```
cryosparcm icli
```
Paste the code from the text editor in the resulting prompt and press Enter to run it
Press Ctrl + D and confirm to exit
In the cryosparc2_master or cryosparc_master installation directory, look for a file named particle-sizes-output.txt. Send this file to me
Download the particle output file from the cryoSPARC interface, as indicated:

Screen Shot of highlighted Download Output button in the cryoSPARC interface1280×280 82.2 KB

Send it to me. The file may be a few hundred MB, so feel free to use a file sharing service such as Dropbox, Google Drive, OneDrive, etc.

Let me know if you run into trouble with any of those steps.

MHB · January 16, 2021, 5:08am

Will do…distracted on another project so will get back to this shortly

WGL · January 16, 2021, 7:16pm

I am having a similar problem as olibclarke. I did the mongo command

and my locked files went from 4k to 0. I restarted cryosparc, cloned the job again, but same problem. I check and the 4k files were locked again.

Running v3.0.1

nfrasser · January 29, 2021, 11:08pm

@MHB okay, cache should now be working as expected in the latest cryoSPARC v3.1, please update and let me know if you run into further issues.

MHB · February 2, 2021, 11:37pm

Thanks. Seems to be working properly. If any changes will inform.

MHB · February 12, 2021, 4:05pm

This seems to work fine now. Thanks.

Also as a useful feature which has been suggested before it would be very helpful if there was a “clear cache” option. I can clear the cache myself but with multiple users there is no way to distinguish whose cache i am clearing. It would be helpful to have an automatic cache clear of a workspace after a certain period of inactivity or a radio button to clear the cache from a given workspace.

nfrasser · February 16, 2021, 4:33pm

@MHB thanks for this suggestion, we’ve added it to our tracker and should have it out for an upcoming release.

MHB · February 16, 2021, 6:51pm

Great thanks! Saves me creating cron jobs…

MHB · February 22, 2021, 6:25pm

I am running into the large particle stack size issue again.
Running v3.1

A particle stack after 2D classification and selection give the following when trying to use the cache in a helical refnement

[CPU: 518.3 MB]  Using random seed of 1054570268
[CPU: 518.3 MB]  Loading a ParticleStack with 34559 items...
[CPU: 518.5 MB]   SSD cache : cache successfuly synced in_use
[CPU: 519.3 MB]   SSD cache : cache successfuly synced, found 166958.02MB of files on SSD.
[CPU: 519.3 MB]   SSD cache : cache successfuly requested to check 6645 files.
[CPU: 521.6 MB]  Detected file change due to change in file size.
[CPU: 520.1 MB]   SSD cache : cache requires 1867125.75MB more on the SSD for files to be downloaded.
[CPU: 520.1 MB]   SSD cache : cache does not have enough space for download
[CPU: 521.1 MB]   SSD cache :   but there are files that can be deleted, deleting...
[CPU: 521.6 MB]   SSD cache : cache does not have enough space for download
[CPU: 520.4 MB]   SSD cache :   but there are no files that can be deleted. 
[CPU: 520.4 MB]   SSD cache :   This could be because other jobs are running and using files, or because a different program has used up space on the SSD.

i.e. apparent size of 1.8Tb!

If re-extract the same particle stack with the same parameters now i get this

[CPU: 519.1 MB]  Using random seed of 764820665
[CPU: 519.1 MB]  Loading a ParticleStack with 34559 items...
[CPU: 519.2 MB]   SSD cache : cache successfuly synced in_use
[CPU: 519.3 MB]   SSD cache : cache successfuly synced, found 102962.49MB of files on SSD.
[CPU: 519.4 MB]   SSD cache : cache successfuly requested to check 6645 files.
[CPU: 521.0 MB]   SSD cache : cache requires 11871.38MB more on the SSD for files to be downloaded.
[CPU: 521.0 MB]   SSD cache : cache has enough available space.

Now only 11Gb…reasonable size

I could not get the script you pasted above to run to complete the reqeusted diagnostic

nfrasser · February 24, 2021, 10:31pm

Hi @MHB, please send me the following information for further troubleshooting:

How many particles went into the original 2D Classification job you used to filter these out?
Were the particles sourced from outside of cryoSPARC or via an “Import Particles Job”?
Were the particles extracted in a job that run the update from v3.1?
What is the size on disk of the “Extract from Micrographs” or “Import Particle Stack” job used to originally extract these particles (select the job and look in the sidebar)
Send me the .cs file for the outputs of the parent job where the refinements were sourced for helical refinement (e.g., the particles_selected output of Select 2D job). Download it from the job’s Outputs tab (see screenshot). Feel free to use a file-sharing service and direct-message me the link.
Download particles CS file button1034×224 44.9 KB

MHB · February 25, 2021, 7:53pm

These came from imported motioncorr2 micrographs. Here are the steps every step was v3.1 except patchCTF

patch ctf done in v3.0
template picker
3.extracxt 11611488 particles
3 rounds of 2D classification without SSD
Helix refine or 2d classification reports 1867125.75MB SSD required
extract the particles and now 11871.38MB SSD requires

Original extract job size 3.81 TB

will send .cs file

donghuachen · March 29, 2021, 5:54pm

Hi All, I got the same issue with NU Refinement v3.1.0 and my job can’t run. I did not have other jobs running at the same time.

Importing job module for job type nonuniform_refine_new…
[CPU: 534.3 MB] Job ready to run
[CPU: 534.3 MB] ***************************************************************
[CPU: 1.18 GB] Using random seed of 1136935870
[CPU: 1.18 GB] Loading a ParticleStack with 547608 items…
[CPU: 1.19 GB] SSD cache : cache successfuly synced in_use
[CPU: 1.19 GB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
[CPU: 1.19 GB] SSD cache : requested files are locked for past 709s, checking again in 5s