Particles cache issues

Thanks @ebirn for providing these details.
Please can you describe the cache implementation, including details like

  • Does the cache_path as defined for the CryoSPARC target include symbolic links.
  • Is the storage for cache_path local to the compute not or shared between compute nodes.
  • In case of share cache storage, please can you post storage server (export) and client (mount) configuration settings/options.

Hi,

  • The cache path configured for the cache does not contain any symlinks
  • it is a shared filesystem between compute nodes
  • filesystem type is Beegfs. It is a shared filesystem for multiple users and mounted on all compute nodes in the cluster. It is constantly used for various temporary data (as scratch filesystem)

There is sufficient space in the filesystem and the user running the cryosparc jobs does have access to the filesystem. While we can sporadically reproduce this cache issue, I don’t believe it is an issue with the underlying filesystem. I think the over all error rates would be higher than what we see.

@DerLorenz can you confirm this is happening only to specific job type(s), and with those not 100% of the time reproducible?

Best,
Erich

Thanks for looking into this @wtempel.

All jobs that require caching do fail in 1/3 of all submissions. We brut force the jobs by relaunching untill it eventually works. Consequently, the statistics posted by @ebirn do not reflect the current status as we simply restart the job. Once it finished the initially failed jobs will not be registered as “failed”.

@wtempel is there any other information that we can provide in order to help this debugging along?

1 Like

There is. We would like to know whether the problems are related to the filesystem type (of the cache storage). Would it be possible for you to test caching reliability when cache storage is implemented on a different filesystem type? For example, are workers’ /tmp directories implemented on a different filesystem type, and have sufficient capacity to temporarily provide cache storage?

We’ve tried to reproduce the issue on our 2nd Cryosparc instance, but we were unable to do so.
On the 2nd instsance, the caches are per-node local SSDs.
The instance where the issue is triggered, it is a cluster lane with a shared filesystem (Beegfs).

We observe this as a regression on the shared filesystem, since we’ve updated to v4.4.1. We assume it’s related to the cache locking implementation on BeeGFS.

Do you recognise this as a bug, and otherwise what is your recommended setup in this situation?
Should we disable the cache? Or, is there a way to configure a cache path that is scoped to each compute job in the cluster (i.e. Dynamic SSD Cache Paths), but on an ephemeral filsystem?
i.e. /beegfs/per-job-dir that will be wiped after the job run. - My impression was that the cluster lane effectively requires a shared filesystem as cache.

There is at least one way:

  1. a (slurm.conf) Prolog= script that creates a /scr/slurm_job_${SLURM_JOB_ID} directory and chowns it to $SLURM_JOB_USER
  2. combined with a TaskProlog= script that echos a variable definition that points to directory created in the previous step, like
    echo "export SLURM_SCR=/scr/slurm_job_${SLURM_JOB_ID}"
    
  3. combined with a line in the in the cluster script template like
    export CRYOSPARC_SSD_PATH=$SLURM_SCR
    
    It may be possible to instead just
    echo "export CRYOSPARC_SSD_PATH=/scr/slurm_job_${SLURM_JOB_ID}"
    
    in the task prolog (previous step), but I have not tested that (see CryoSPARC environment variables).
  4. combined with an Epilog= script that removes the job-specific scratch directory.

Worker host-specific, non-shared scratch storage is (supposed to be) supported, and the most performant option (nvme ssd), as far as I know. Did you find otherwise?

We have the infrastructure in place and automatically wipe /tmp for each job after it’s finished (we have a private /tmp bind mounted for each job).

For our clarification: the new cache v2 does not keep any information in the database and the cache is self contained in the filesystem(s). I.e. we will not accumulate unusable cache objects (or metadata thereof) when the per-job cache directory is wiped after each run.

We see the documentation in the cache_v2.py file, referencing TODO of alternative global cache locking mechanism. This is also what we would need. Currently our BeeGFS does not have file locking enabled and we cannot easily change this at the moment.

Hi @ebirn, thanks again for providing all this information.

It looks like this is exactly what we’ll need to do to make the cache system work reliably for you. I’ve added it to our tracker for implementation in a future release. Will update here when the fix is available!

2 Likes

Hi @nfrasser & @wtempel,

We now implemented particle caching on a per job basis. This seemed to solve our problems, but we just now ran into another issue.
Following a standard analysis pipeline everything works just fine.
In one project I was reimporting particle.star files after a relion 3D classification using the “import particle stack” job. When refining each imported particle stack job alone, the refinements do run succesfully. The moment I combine imported stacks the jobs fail due to cache locking (presumably). We get the following error message:

[CPU:  407.2 MB]
──────────────────────────────────────────────────────────────
SSD cache ACTIVE at /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001 (10 GB reserve) (15 TB quota)
┌─────────────────────┬────────────────────────┐
│ Cache usage         │                 Amount │
├─────────────────────┼────────────────────────┤
│ Total / Usable      │ 244.40 TiB / 13.64 TiB │
│ Used / Free         │  15.27 GiB / 13.63 TiB │
│ Hits / Misses       │     0.00 B / 15.27 GiB │
│ Acquired / Required │     15.27 GiB / 0.00 B │
└─────────────────────┴────────────────────────┘
Progress: [▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 20978/20978 (100%)
  Transferred: FoilHole_8835610_Data_8832246_8832248_20230109_233539_EER_patch_aligned_doseweighted_particles.mrc (479.12 kiB)
      Threads: 8
    Avg speed: 534.33 MiB/s
    Remaining: 0h 00m 00s (0.00 B)
      Elapsed: 0h 01m 16s
  Active jobs: P168-J641
SSD cache complete for 20978 file(s)
──────────────────────────────────────────────────────────────

[CPU:  407.2 MB]
SSD cache: using override environment variable for cache $CRYOSPARC_SSD_PATH=/tmp

[CPU:  445.6 MB]
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 876, in run_with_executor
    drive.unlock_file(transfer[0].key)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 656, in unlock_file
    cached.unlock()
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 241, in unlock
    assert self.locked, f"Cannot unlock cache store file {self.path}"
AssertionError: Cannot unlock cache store file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

Here a little bit background for the particles:

  1. Particles are extracted in cryoSPARC.
  2. Imported to relion using cs2star tool (GitHub - brisvag/cs2star). This results in the particle.mrc being softlinked as particles.mrcs in my relion project and referenced accordingly in the star file.
  3. 3D classification in relion
  4. Selected classes are individually imported into cryoSPARC. The particle data path is the /path/to/softlinked/particles.mrcs (referencing back to the particle.mrc extracted in CS).

@ebirn was wondering if there is a possibility to disable the cache locking. In our setup (per job caching) we should not have any concurrent use of the cache.

If you need any further information/logs etc, let us know.

Thanks a lot for your help!

Best,
Lorenz

@DerLorenz thanks for sharing this, this doesn’t look related to file-system level locking. Instead, it seems related to a special filename suffix “lock” we use to ensure that two jobs don’t copy the same file. It looks like the cache system successfully completed caching, but tried to remove the filename suffix of a file that didn’t have it (i.e., already previously unlocked). This may indicate a bug in the cache system.

Could you share the output log for this job with me for further troubleshooting? You can find it the web interface: open the job and go to Metadata > Log. You may send it to me via direct-message.

Hi,
Note: the cache is currently scoped to the life-time of the slurm job itself, i.e. this whole /tmp filesystem will be removed with the job.

We typically see 2 types of stack traces:
Variant 1:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 855, in run_with_executor
    transfer = task.result()
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 644, in transfer_allocated_file
    assert cached.locked, f"Cannot copy {source.path} to unlocked cache file {cached.path}"
AssertionError: Cannot copy /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-hannabox1-1-20230104-124954-208236/J523/extract/FoilHole_8834114_Data_8832246_8832248_20230109_183045_EER_patch_aligned_doseweighted_particles.mrc to unlocked cache file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

Variant 2:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 876, in run_with_executor
    drive.unlock_file(transfer[0].key)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 656, in unlock_file
    cached.unlock()
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 241, in unlock
    assert self.locked, f"Cannot unlock cache store file {self.path}"
AssertionError: Cannot unlock cache store file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

the full log is just megabytes of lines of “Transferred … file to SSD key …”

full logs are available here:
https://filesender.aco.net/?s=download&token=403268f0-9eef-4f6c-8dc3-7c61bce9243f

Thanks @DerLorenz and @ebirn, we were able to find the issue on our end. These errors appear to happen when there the input particles are located in imported .mrc/.mrcs files, which CryoSPARC tracks as unique symbolic links but resolve to the same original file.

We were able to reproduce with an arrangement like this:

/
├── particles
│   ├── relion_particles_A.mrcs
│   ├── relion_particles_B.mrcs
│   ├── particles1.mrcs
│   ├── particles2.mrcs
│   └── particles3.mrcs
└── projects
    └── P1
        ├── J1
        │   ├── imported
        │   │   ├── particles1.mrcs -> ../../../../particles/particles1.mrcs
        │   │   ├── particles2.mrcs -> ../../../../particles/particles2.mrcs
        │   │   └── particles3.mrcs -> ../../../../particles/particles3.mrcs
        │   └── imported_particles.cs
        ├── J2
        │   ├── imported
        │   │   ├── particles1.mrcs -> ../../../../particles/particles1.mrcs
        │   │   ├── particles2.mrcs -> ../../../../particles/particles2.mrcs
        │   │   └── particles3.mrcs -> ../../../../particles/particles3.mrcs
        │   └── imported_particles.cs
        └── J3
            └── job.log

J1 imports relion_particles_A.star and J2 imports relion_particles_B.star, but both star files point to the particles located the same .mrcs stack files. Both imports are then given as input to J3. Though the particle locations differ, J3 gets the following list of files to cache:

J1/imported/particles1.mrcs
J1/imported/particles2.mrcs
J1/imported/particles3.mrcs
J2/imported/particles1.mrcs
J2/imported/particles2.mrcs
J2/imported/particles3.mrcs

The cache system sees these paths as unique and tries to copy all of them, but because they resolve to the same original files, it tries to copy them twice and runs one of the errors above.

Could you confirm that you have a similar arrangement? Note that J3 can be much further downstream from the original imported particles.

A workaround you could try until our fix is out is to restack the input particles before running the refinement jobs so that they don’t use the same path.

Thanks again and let me know how this works for you or if you have any questions!

PS: I assume that the particle locations in the import jobs are unique, though the paths are not? Careful that you don’t provide the same particle locations to refinement jobs to avoid biasing the results.

Hi @nfrasser,

Indeed we have a similar setup. I have several particle stacks (particle.mrc) extracted in one CS extract job. After 3D classification I reimport to CS.
As particle.mrcs (symlink) of different classes will link to the same original particle.mrc stack (but different particles therein) the CS-refine jobs will have to access the same stack, if two or more classes are combined. Thus, we have the same underlying data structure given I understood yours correctly.

Thanks for the heads up, I do not try to refine the same particles.

I figured that reextraction and restacking should take care of the problem. While being a viable workaround, I will have the particles saved twice (restacked/reextracted and the original stack) and thus require more storage. As I am also still working with the original stack in relion and other software, I cannot delete the old stack.
Another workaorund is to combine the classes star files and import the combined.star. That at least reduces storage space.

Best,
Lorenz

1 Like

Hi @DerLorenz and @ebirn, thanks again for your feedback on the new cache system. We’ve addressed both issues noted in this topic (synchronization-related error and symlinking bug) in the latest CryoSPARC v4.5, released May 7 2024.

Once updated, please add the following line to cryosparc_worker/config.sh to use a global lock instead of a file lock:

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"

This means you can revert your implementation of per-job particle caching so that jobs may re-use cache from jobs.

1 Like

Hi @nfrasser,

We updated to v4.5.1 and reverted to per-job caching at some point in May. We are still having the same error messages (please see the initial post). Again, it seems to correlate with two factors:

  1. Multiple jobs accessing the same cached particles
  2. Overall higher trafic on the cache by the CS-instance

Both leading to increased amount of failed jobs with the same error message.

I guess that we will try to go back to a per-job caching strategy, unless you have another idea.
As always we are happy to help by sharing any requested further information or any other means.

Cheers,
Lorenz

@DerLorenz Please can you confirm whether:

  1. the cache filesystem is shared between worker hosts
  2. has been added to cryosparc_worker/config.sh of each worker.

Hi,
I can confirm this is the case on the machine.
The cryosparc worker is installed in the (cryosparc) user’s home, that is a shared filesystem on the cluster.
We had the config

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"
# single global cache path, requires "master" strategy
export CRYOSPARC_SSD_PATH="/shared/beegfs/path/lane_cache"

I’ve also observed in the logs, that the worker ist polling for the lock on the master and eventually acquiring it.
These settings were active roughly since beginning of June. We have only encountered the issue more recently, since 2 weeks or so.

As a workaround, we’ve now switched to per-job caching, by doing this:

  • removed the CRYOSPARC_CACHE_LOCK_STRATEGY stanza
  • set CRYOSPARC_SSD_PATH="/tmp"

The /tmp filesystem is provisioned per job, i.e. for every new cryosparc job, there will always be an empty cache. Note: In both cases we have set these options:

export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=20
# threads for cache copy stage
export CRYOSPARC_CACHE_NUM_THREADS=8
# new caching system
export CRYOSPARC_IMPROVED_SSD_CACHE=true

Best,
Erich

Thanks for this information @ebirn.
Please can you confirm that/whether

  1. @ebirn and @DerLorenz are referring to the same cluster and CryoSPARC installation
  2. per-job caching is an effective workaround for FileNotFoundError

Do I interpret this correctly in that caching on /shared/beegfs/path/lane_cache worked for a few weeks in June and July? Did the (re-)emergence of the FileNotFoundError(?) issue coincide with any software or other changes?
It may be helpful to tabulate configuration changes and cache errors like:

time frame CACHE_LOCK_STRATEGY cache storage per-job cache version other changes FileNotFound
January not set beegfs no 4.4.1 ? yes
May
June
mid July
end July not set ? yes ? ?

(markdown table template)

| time frame | `CACHE_LOCK_STRATEGY` |cache storage|per-job cache |version |other changes| `FileNotFound` |
|-|-|-|-|-|-|-|
|January|not set|beegfs|no|4.4.1|?|yes|
|...|||||||
|May|||||||
|June|||||||
|mid July|||||||
|end July|not set|?|yes|?|?||