Cached files of running jobs deleted

Will this change in 4.3?

This will likely not change in v4.3.

Hi,

but having an individual cache for each worker will remove all advantages of a
shared cache (storage size, network traffic etc …).
What is exactly the problem with the shared cache ?
(For smaller datasets it works.)
As we have a lot of big projects this would really be a problem for us.
So if there is a solution in the near future it would really help us.

Is it the same for jobs submitted to a slurm cluster ?
Can they share the cache ?

thanks for helping

Florian

I suspect that the problem arises when cache is full, and one job is trying to free some space and deletes files from other jobs and it may happen that those jobs are still running.
I didn’t see it being a problem yet, probably because we have quite a large cache quota for each user (10TB), two-week retention and we migrated to a new cluster only a few months ago (on the previous machine we didn’t use cache), but it’s only a matter of time. Huge +1 for supporting cache space shared between workers.

Our use case for that is nvme-based lustre, that is much smaller globally (and has global data retention) than HDD-based lustre, so whole projects cannot be stored there.

Hi,

I’m pretty sure that there was still enough space.
Is there a way that can check (from the log) why cached
files get removed.
Can I configure that only cached files which are not used get removed ?
How does the master process decide which files get removed ?

thank you so much for your help

Florian

The problem arises due to a combination of how the current cache system tracks and clears for deletion cached files. We are looking for a design that supports the shared cache case without deteriorating support for the host-specific cache case.

1 Like

Hello again :slight_smile:

Does the new caching subsystem in 4.4 support the case of cache location shared between workers? The description of the feature isn’t very detailed. Can you elaborate a bit what has been changed?

@bsobol The new cache system implements a different logic to track cached files compared to the older cache system. The new cache logic is compatible with cache storage that is shared between workers.

1 Like

That’s a great news. Thanks!

Dear Cryosparc Team,

I tried the CRYOSPARC_CACHE_LOCK_STRATEGY=“master” but we still have the Problem
that we get the No such File or directory error as soon as more than 4 big jobs are running in paralell.
We have a central scratch for all workers with gpfs as filesystem.

-Is there anything we can do to improve the situation ?
-Could you give some insights where the problem with the paralell fiesystems is
comming from ? (They also provide posix locks)
-Do you know any site where Cryosparc is running with a central cache and many paralell jobs/users ?

cat cryosparc_worker_hpcl930x/config.sh

export CRYOSPARC_LICENSE_ID=“xxxx”
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CACHE_LOCK_STRATEGY=“master”
export CRYOSPARC_IMPROVED_SSD_CACHE=true
export CRYOSPARC_CACHE_NUM_THREADS=12

thanks

Florian

yes this happens to us as well. People simply restart jobs and at times it goes through. But we also had cases where jobs failed due to caching error after running for 30 hours.

We have mix of local GPU nodes as well as cluster nodes. But they share central scratch for all workers. This happens in both cluster nodes as well in local GPU boxes thats not part of the submission system.

Kindly advice how to tackle this issue.

@fbeck @Rajan What version of CryoSPARC do you use? Do the jobs with cache-related errors involve imported particles as described in Particles cache issues - #18 by nfrasser ?
@Rajan Do you cluster and “local” nodes share the same cryosparc_worker/ installation? What is the output of the command

/path/to/cryosparc_worker/bin/cryosparcw env | grep LOCK

for each independent cryosparc_worker/ installation?

Hi

we run cryoparc version 4.5.1.
Most of the jobs have imported particles from relion.
But the error also occurs for jobs which only run in cryosparc.
(No particle import)
We also just get the particle not found error and it takes very long
time until the job starts.

best

Florian

Worker 1:
./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”
Worker 2:
./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”
Worker 3:
./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

Hi,

Our Cryosparc version is 4.5.3

Worker 1: Local Nodes ( 2 nodes they share the same cryosparc_worker installation )

./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

Worker 2: Cluster Nodes ( one installation for all nodes )

./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

For the local nodes as well as cluster SSD cache is common.

Thanks for all the help. I hope we can can resolve it soon.

Best
Rajan

@Rajan @fbeck When you again encounter an error due to a file missing on cache, please can you run the command
cryosparcm snaplogs and email us the tgz file that the command produces. I will send you a direct message about the email address.

Hi @Rajan and @fbeck, thank you for sharing your logs and your continued patience as we investigate this error. Unfortunately we have not yet determined the cause of these “File not found” errors after the following steps:

  • We conducted thorough testing of the SSD cache system on a multi-node BeeGFS installation with export CRYOSPARC_CACHE_LOCK_STRATEGY=master in cryosparc_worker/config.sh on the latest CryoSPARC v4.6
  • We did not observe any errors with up to 5 jobs running on different nodes accessing the cache simultaneously.
    • We did observe “File not found” errors on versions of CryoSPARC older than v4.4.0 or with no CRYOSPARC_CACHE_LOCK_STRATEGY set
  • We carefully inspected the logs you provided and have not identified any signs of incorrect behaviour
  • We conducted a thorough audit of our caching and locking code and did not find any gaps in our implementation that could cause these errors

To further proceed with our investigation, could you share any additional relevant details about your system configuration? We’re most interested in the following:

  • The most recent CryoSPARC version in which you have observed this error
  • A job report of the most recent job where the error occurred
  • A summary of the specifications for your compute systems, including
    • CPU model
    • Total RAM
    • Operating system and kernel version (e.g., output of uname -a)
    • Types of applications that run on this system and access the shared SSD
    • A summary of processes and services that run on this node alongside CryoSPARC, (e.g., if the system uses systemd, what is the output of systemctl --type=service)
    • Mount options for the particle cache filesystem
  • Configuration of your shared cache system, including
    • Which parallel file system it uses (e.g., GPFS, BeeGFS, Ceph, etc.) and what version.
    • Any configuration files or settings for the parallel file system, particularly any settings that have been overridden from the defaults
    • A summary of node types used (e.g., management, metadata, storage, client, etc.)
    • How many of each type there are
    • Where is each node hosted (e.g., are they running on the GPU compute nodes or on dedicated storage nodes?)

As before, you may send us any files you collect by email.

Thank you again!

Hi,

thanks for working on the Problem.
I try to collect the information as soon as possible.
In brief we use gpfs on dedicated fileservers and cryosparc 4.5.1 with
“CRYOSPARC_CACHE_LOCK_STRATEGY=master”.
OS Worker: SLES 15 SP4

But I think it’s maybe faster for both sides if we provide an account and
resource so that you can test on our system (need to check).
Would that be an option for you ?

For us the error usually occurs if more than 6 big jobs
(~Million Particles box > 128) are started (and caching)
at the same time. (2d classification, het. Refinement, +Extraction on top)
Could you test with 6-10 big parallel jobs accessing a
big cache 50TB which is 90% full. (started at the time)
It also takes sometimes very long time to just find an already cached
dataset (up to 60min).

thanks

Florian

Hi,

…send the full info to:
[redacted]@[redacted]
…it usually dies while checking the files on ssd,
which sometimes needs more than 1 hour.

best

Florian

1 Like

Hi @fbeck, thanks very much for this information. Based on your job reports we were able to find a bug in the SSD cache system that would trigger “File Not Found” errors during the “Checking files on SSD” step. This has been fixed in the latest v4.6.1 release. Please try it out and let me know if the issue is resolved for you.

Regarding the following:

We did find something similar in our own testing. For context, the cache system was initially designed for fast local SSDs, where we assumed that listing the contents of the cache drive and retrieving their metadata (size, modified time) are relatively fast operations. We did not optimize for parallel file systems such as GPFS where this may not be the case.

The latest update unfortunately does not include any performance improvements, but we’ve recorded the issue and plan to address it in a future update.

2 Likes

Hi @nfrasser ,

Thanks for the efforts, very much appreciated.

I have updated my instance today and users started using it. I will let you know how it goes.

best
rajan