@Rajan@fbeck When you again encounter an error due to a file missing on cache, please can you run the command cryosparcm snaplogs and email us the tgz file that the command produces. I will send you a direct message about the email address.
Hi @Rajan and @fbeck, thank you for sharing your logs and your continued patience as we investigate this error. Unfortunately we have not yet determined the cause of these “File not found” errors after the following steps:
We conducted thorough testing of the SSD cache system on a multi-node BeeGFS installation with export CRYOSPARC_CACHE_LOCK_STRATEGY=master in cryosparc_worker/config.sh on the latest CryoSPARC v4.6
We did not observe any errors with up to 5 jobs running on different nodes accessing the cache simultaneously.
We did observe “File not found” errors on versions of CryoSPARC older than v4.4.0 or with no CRYOSPARC_CACHE_LOCK_STRATEGY set
We carefully inspected the logs you provided and have not identified any signs of incorrect behaviour
We conducted a thorough audit of our caching and locking code and did not find any gaps in our implementation that could cause these errors
To further proceed with our investigation, could you share any additional relevant details about your system configuration? We’re most interested in the following:
The most recent CryoSPARC version in which you have observed this error
A job report of the most recent job where the error occurred
A summary of the specifications for your compute systems, including
CPU model
Total RAM
Operating system and kernel version (e.g., output of uname -a)
Types of applications that run on this system and access the shared SSD
A summary of processes and services that run on this node alongside CryoSPARC, (e.g., if the system uses systemd, what is the output of systemctl --type=service)
Mount options for the particle cache filesystem
Configuration of your shared cache system, including
Which parallel file system it uses (e.g., GPFS, BeeGFS, Ceph, etc.) and what version.
Any configuration files or settings for the parallel file system, particularly any settings that have been overridden from the defaults
A summary of node types used (e.g., management, metadata, storage, client, etc.)
How many of each type there are
Where is each node hosted (e.g., are they running on the GPU compute nodes or on dedicated storage nodes?)
As before, you may send us any files you collect by email.
thanks for working on the Problem.
I try to collect the information as soon as possible.
In brief we use gpfs on dedicated fileservers and cryosparc 4.5.1 with
“CRYOSPARC_CACHE_LOCK_STRATEGY=master”.
OS Worker: SLES 15 SP4
But I think it’s maybe faster for both sides if we provide an account and
resource so that you can test on our system (need to check). Would that be an option for you ?
For us the error usually occurs if more than 6 big jobs
(~Million Particles box > 128) are started (and caching)
at the same time. (2d classification, het. Refinement, +Extraction on top)
Could you test with 6-10 big parallel jobs accessing a
big cache 50TB which is 90% full. (started at the time)
It also takes sometimes very long time to just find an already cached
dataset (up to 60min).
Hi @fbeck, thanks very much for this information. Based on your job reports we were able to find a bug in the SSD cache system that would trigger “File Not Found” errors during the “Checking files on SSD” step. This has been fixed in the latest v4.6.1 release. Please try it out and let me know if the issue is resolved for you.
Regarding the following:
We did find something similar in our own testing. For context, the cache system was initially designed for fast local SSDs, where we assumed that listing the contents of the cache drive and retrieving their metadata (size, modified time) are relatively fast operations. We did not optimize for parallel file systems such as GPFS where this may not be the case.
The latest update unfortunately does not include any performance improvements, but we’ve recorded the issue and plan to address it in a future update.