Cached files of running jobs deleted

Rajan · October 2, 2024, 2:39pm

Hi,

Our Cryosparc version is 4.5.3

Worker 1: Local Nodes ( 2 nodes they share the same cryosparc_worker installation )

./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

Worker 2: Cluster Nodes ( one installation for all nodes )

./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

For the local nodes as well as cluster SSD cache is common.

Thanks for all the help. I hope we can can resolve it soon.

Best
Rajan

wtempel · October 11, 2024, 2:00pm

@Rajan @fbeck When you again encounter an error due to a file missing on cache, please can you run the command
cryosparcm snaplogs and email us the tgz file that the command produces. I will send you a direct message about the email address.

nfrasser · November 7, 2024, 7:37pm

Hi @Rajan and @fbeck, thank you for sharing your logs and your continued patience as we investigate this error. Unfortunately we have not yet determined the cause of these “File not found” errors after the following steps:

We conducted thorough testing of the SSD cache system on a multi-node BeeGFS installation with export CRYOSPARC_CACHE_LOCK_STRATEGY=master in cryosparc_worker/config.sh on the latest CryoSPARC v4.6
We did not observe any errors with up to 5 jobs running on different nodes accessing the cache simultaneously.
- We did observe “File not found” errors on versions of CryoSPARC older than v4.4.0 or with no CRYOSPARC_CACHE_LOCK_STRATEGY set
We carefully inspected the logs you provided and have not identified any signs of incorrect behaviour
We conducted a thorough audit of our caching and locking code and did not find any gaps in our implementation that could cause these errors

To further proceed with our investigation, could you share any additional relevant details about your system configuration? We’re most interested in the following:

The most recent CryoSPARC version in which you have observed this error
A job report of the most recent job where the error occurred
A summary of the specifications for your compute systems, including
- CPU model
- Total RAM
- Operating system and kernel version (e.g., output of uname -a)
- Types of applications that run on this system and access the shared SSD
- A summary of processes and services that run on this node alongside CryoSPARC, (e.g., if the system uses systemd, what is the output of systemctl --type=service)
- Mount options for the particle cache filesystem
Configuration of your shared cache system, including
- Which parallel file system it uses (e.g., GPFS, BeeGFS, Ceph, etc.) and what version.
- Any configuration files or settings for the parallel file system, particularly any settings that have been overridden from the defaults
- A summary of node types used (e.g., management, metadata, storage, client, etc.)
- How many of each type there are
- Where is each node hosted (e.g., are they running on the GPU compute nodes or on dedicated storage nodes?)

As before, you may send us any files you collect by email.

Thank you again!

fbeck · November 8, 2024, 8:43am

Hi,

thanks for working on the Problem.
I try to collect the information as soon as possible.
In brief we use gpfs on dedicated fileservers and cryosparc 4.5.1 with
“CRYOSPARC_CACHE_LOCK_STRATEGY=master”.
OS Worker: SLES 15 SP4

But I think it’s maybe faster for both sides if we provide an account and
resource so that you can test on our system (need to check).
Would that be an option for you ?

For us the error usually occurs if more than 6 big jobs
(~Million Particles box > 128) are started (and caching)
at the same time. (2d classification, het. Refinement, +Extraction on top)
Could you test with 6-10 big parallel jobs accessing a
big cache 50TB which is 90% full. (started at the time)
It also takes sometimes very long time to just find an already cached
dataset (up to 60min).

thanks

Florian

fbeck · November 8, 2024, 2:11pm

Hi,

…send the full info to:
[redacted]@[redacted]
…it usually dies while checking the files on ssd,
which sometimes needs more than 1 hour.

best

Florian

nfrasser · November 13, 2024, 4:06pm

Hi @fbeck, thanks very much for this information. Based on your job reports we were able to find a bug in the SSD cache system that would trigger “File Not Found” errors during the “Checking files on SSD” step. This has been fixed in the latest v4.6.1 release. Please try it out and let me know if the issue is resolved for you.

Regarding the following:

We did find something similar in our own testing. For context, the cache system was initially designed for fast local SSDs, where we assumed that listing the contents of the cache drive and retrieving their metadata (size, modified time) are relatively fast operations. We did not optimize for parallel file systems such as GPFS where this may not be the case.

The latest update unfortunately does not include any performance improvements, but we’ve recorded the issue and plan to address it in a future update.

Rajan · November 15, 2024, 10:12am

Hi @nfrasser ,

Thanks for the efforts, very much appreciated.

I have updated my instance today and users started using it. I will let you know how it goes.

best
rajan