Particles cache issues

Hi @ebirn, thanks again for providing all this information.

It looks like this is exactly what we’ll need to do to make the cache system work reliably for you. I’ve added it to our tracker for implementation in a future release. Will update here when the fix is available!

2 Likes

Hi @nfrasser & @wtempel,

We now implemented particle caching on a per job basis. This seemed to solve our problems, but we just now ran into another issue.
Following a standard analysis pipeline everything works just fine.
In one project I was reimporting particle.star files after a relion 3D classification using the “import particle stack” job. When refining each imported particle stack job alone, the refinements do run succesfully. The moment I combine imported stacks the jobs fail due to cache locking (presumably). We get the following error message:

[CPU:  407.2 MB]
──────────────────────────────────────────────────────────────
SSD cache ACTIVE at /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001 (10 GB reserve) (15 TB quota)
┌─────────────────────┬────────────────────────┐
│ Cache usage         │                 Amount │
├─────────────────────┼────────────────────────┤
│ Total / Usable      │ 244.40 TiB / 13.64 TiB │
│ Used / Free         │  15.27 GiB / 13.63 TiB │
│ Hits / Misses       │     0.00 B / 15.27 GiB │
│ Acquired / Required │     15.27 GiB / 0.00 B │
└─────────────────────┴────────────────────────┘
Progress: [▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 20978/20978 (100%)
  Transferred: FoilHole_8835610_Data_8832246_8832248_20230109_233539_EER_patch_aligned_doseweighted_particles.mrc (479.12 kiB)
      Threads: 8
    Avg speed: 534.33 MiB/s
    Remaining: 0h 00m 00s (0.00 B)
      Elapsed: 0h 01m 16s
  Active jobs: P168-J641
SSD cache complete for 20978 file(s)
──────────────────────────────────────────────────────────────

[CPU:  407.2 MB]
SSD cache: using override environment variable for cache $CRYOSPARC_SSD_PATH=/tmp

[CPU:  445.6 MB]
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 876, in run_with_executor
    drive.unlock_file(transfer[0].key)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 656, in unlock_file
    cached.unlock()
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 241, in unlock
    assert self.locked, f"Cannot unlock cache store file {self.path}"
AssertionError: Cannot unlock cache store file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

Here a little bit background for the particles:

  1. Particles are extracted in cryoSPARC.
  2. Imported to relion using cs2star tool (GitHub - brisvag/cs2star). This results in the particle.mrc being softlinked as particles.mrcs in my relion project and referenced accordingly in the star file.
  3. 3D classification in relion
  4. Selected classes are individually imported into cryoSPARC. The particle data path is the /path/to/softlinked/particles.mrcs (referencing back to the particle.mrc extracted in CS).

@ebirn was wondering if there is a possibility to disable the cache locking. In our setup (per job caching) we should not have any concurrent use of the cache.

If you need any further information/logs etc, let us know.

Thanks a lot for your help!

Best,
Lorenz

@DerLorenz thanks for sharing this, this doesn’t look related to file-system level locking. Instead, it seems related to a special filename suffix “lock” we use to ensure that two jobs don’t copy the same file. It looks like the cache system successfully completed caching, but tried to remove the filename suffix of a file that didn’t have it (i.e., already previously unlocked). This may indicate a bug in the cache system.

Could you share the output log for this job with me for further troubleshooting? You can find it the web interface: open the job and go to Metadata > Log. You may send it to me via direct-message.

Hi,
Note: the cache is currently scoped to the life-time of the slurm job itself, i.e. this whole /tmp filesystem will be removed with the job.

We typically see 2 types of stack traces:
Variant 1:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 855, in run_with_executor
    transfer = task.result()
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 644, in transfer_allocated_file
    assert cached.locked, f"Cannot copy {source.path} to unlocked cache file {cached.path}"
AssertionError: Cannot copy /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-hannabox1-1-20230104-124954-208236/J523/extract/FoilHole_8834114_Data_8832246_8832248_20230109_183045_EER_patch_aligned_doseweighted_particles.mrc to unlocked cache file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

Variant 2:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 876, in run_with_executor
    drive.unlock_file(transfer[0].key)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 656, in unlock_file
    cached.unlock()
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 241, in unlock
    assert self.locked, f"Cannot unlock cache store file {self.path}"
AssertionError: Cannot unlock cache store file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

the full log is just megabytes of lines of “Transferred … file to SSD key …”

full logs are available here:
https://filesender.aco.net/?s=download&token=403268f0-9eef-4f6c-8dc3-7c61bce9243f

Thanks @DerLorenz and @ebirn, we were able to find the issue on our end. These errors appear to happen when there the input particles are located in imported .mrc/.mrcs files, which CryoSPARC tracks as unique symbolic links but resolve to the same original file.

We were able to reproduce with an arrangement like this:

/
├── particles
│   ├── relion_particles_A.mrcs
│   ├── relion_particles_B.mrcs
│   ├── particles1.mrcs
│   ├── particles2.mrcs
│   └── particles3.mrcs
└── projects
    └── P1
        ├── J1
        │   ├── imported
        │   │   ├── particles1.mrcs -> ../../../../particles/particles1.mrcs
        │   │   ├── particles2.mrcs -> ../../../../particles/particles2.mrcs
        │   │   └── particles3.mrcs -> ../../../../particles/particles3.mrcs
        │   └── imported_particles.cs
        ├── J2
        │   ├── imported
        │   │   ├── particles1.mrcs -> ../../../../particles/particles1.mrcs
        │   │   ├── particles2.mrcs -> ../../../../particles/particles2.mrcs
        │   │   └── particles3.mrcs -> ../../../../particles/particles3.mrcs
        │   └── imported_particles.cs
        └── J3
            └── job.log

J1 imports relion_particles_A.star and J2 imports relion_particles_B.star, but both star files point to the particles located the same .mrcs stack files. Both imports are then given as input to J3. Though the particle locations differ, J3 gets the following list of files to cache:

J1/imported/particles1.mrcs
J1/imported/particles2.mrcs
J1/imported/particles3.mrcs
J2/imported/particles1.mrcs
J2/imported/particles2.mrcs
J2/imported/particles3.mrcs

The cache system sees these paths as unique and tries to copy all of them, but because they resolve to the same original files, it tries to copy them twice and runs one of the errors above.

Could you confirm that you have a similar arrangement? Note that J3 can be much further downstream from the original imported particles.

A workaround you could try until our fix is out is to restack the input particles before running the refinement jobs so that they don’t use the same path.

Thanks again and let me know how this works for you or if you have any questions!

PS: I assume that the particle locations in the import jobs are unique, though the paths are not? Careful that you don’t provide the same particle locations to refinement jobs to avoid biasing the results.

Hi @nfrasser,

Indeed we have a similar setup. I have several particle stacks (particle.mrc) extracted in one CS extract job. After 3D classification I reimport to CS.
As particle.mrcs (symlink) of different classes will link to the same original particle.mrc stack (but different particles therein) the CS-refine jobs will have to access the same stack, if two or more classes are combined. Thus, we have the same underlying data structure given I understood yours correctly.

Thanks for the heads up, I do not try to refine the same particles.

I figured that reextraction and restacking should take care of the problem. While being a viable workaround, I will have the particles saved twice (restacked/reextracted and the original stack) and thus require more storage. As I am also still working with the original stack in relion and other software, I cannot delete the old stack.
Another workaorund is to combine the classes star files and import the combined.star. That at least reduces storage space.

Best,
Lorenz

1 Like

Hi @DerLorenz and @ebirn, thanks again for your feedback on the new cache system. We’ve addressed both issues noted in this topic (synchronization-related error and symlinking bug) in the latest CryoSPARC v4.5, released May 7 2024.

Once updated, please add the following line to cryosparc_worker/config.sh to use a global lock instead of a file lock:

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"

This means you can revert your implementation of per-job particle caching so that jobs may re-use cache from jobs.

1 Like

Hi @nfrasser,

We updated to v4.5.1 and reverted to per-job caching at some point in May. We are still having the same error messages (please see the initial post). Again, it seems to correlate with two factors:

  1. Multiple jobs accessing the same cached particles
  2. Overall higher trafic on the cache by the CS-instance

Both leading to increased amount of failed jobs with the same error message.

I guess that we will try to go back to a per-job caching strategy, unless you have another idea.
As always we are happy to help by sharing any requested further information or any other means.

Cheers,
Lorenz

@DerLorenz Please can you confirm whether:

  1. the cache filesystem is shared between worker hosts
  2. has been added to cryosparc_worker/config.sh of each worker.

Hi,
I can confirm this is the case on the machine.
The cryosparc worker is installed in the (cryosparc) user’s home, that is a shared filesystem on the cluster.
We had the config

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"
# single global cache path, requires "master" strategy
export CRYOSPARC_SSD_PATH="/shared/beegfs/path/lane_cache"

I’ve also observed in the logs, that the worker ist polling for the lock on the master and eventually acquiring it.
These settings were active roughly since beginning of June. We have only encountered the issue more recently, since 2 weeks or so.

As a workaround, we’ve now switched to per-job caching, by doing this:

  • removed the CRYOSPARC_CACHE_LOCK_STRATEGY stanza
  • set CRYOSPARC_SSD_PATH="/tmp"

The /tmp filesystem is provisioned per job, i.e. for every new cryosparc job, there will always be an empty cache. Note: In both cases we have set these options:

export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=20
# threads for cache copy stage
export CRYOSPARC_CACHE_NUM_THREADS=8
# new caching system
export CRYOSPARC_IMPROVED_SSD_CACHE=true

Best,
Erich

Thanks for this information @ebirn.
Please can you confirm that/whether

  1. @ebirn and @DerLorenz are referring to the same cluster and CryoSPARC installation
  2. per-job caching is an effective workaround for FileNotFoundError

Do I interpret this correctly in that caching on /shared/beegfs/path/lane_cache worked for a few weeks in June and July? Did the (re-)emergence of the FileNotFoundError(?) issue coincide with any software or other changes?
It may be helpful to tabulate configuration changes and cache errors like:

time frame CACHE_LOCK_STRATEGY cache storage per-job cache version other changes FileNotFound
January not set beegfs no 4.4.1 ? yes
May
June
mid July
end July not set ? yes ? ?

(markdown table template)

| time frame | `CACHE_LOCK_STRATEGY` |cache storage|per-job cache |version |other changes| `FileNotFound` |
|-|-|-|-|-|-|-|
|January|not set|beegfs|no|4.4.1|?|yes|
|...|||||||
|May|||||||
|June|||||||
|mid July|||||||
|end July|not set|?|yes|?|?||

Hi @wtempel

Yes, we are on the same machine, @DerLorenz as scientific user, me as operator.
I know that there were no config changes, as we deploy and manage such changes with config management and automation tools.

After the change end of may (to the master lock strategy), for a while there were no errors observed. Later, the daily number of jobs on the cryosparc machine increased. And those errors started to happen again (with the master lock strategy still enabled, no config changes were done).

When it happened, I also did see the lock acquisition logging (until we’ve removed the master lock strategy). The last entries of lock/unlock:

2024-07-25 13:33:44,833 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2432-1721899456
2024-07-25 13:35:06,275 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2432-1721899456
2024-07-25 13:35:07,994 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2433-1721899493
2024-07-25 13:36:25,284 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2433-1721899493
2024-07-25 13:36:26,801 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2429-1721899430
2024-07-25 13:37:47,330 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2429-1721899430
2024-07-25 13:37:48,061 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2434-1721899791
2024-07-25 13:39:08,770 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2434-1721899791
2024-07-25 13:39:11,449 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2435-1721899809
2024-07-25 13:40:32,694 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2435-1721899809
2024-07-25 13:40:34,180 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2430-1721899850
2024-07-25 13:43:04,791 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2430-1721899850
2024-07-25 13:43:05,245 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2431-1721899943
2024-07-25 13:45:37,425 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2431-1721899943
2024-07-25 13:45:38,175 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2433-1721900542
2024-07-25 13:48:07,600 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2433-1721900542
2024-07-25 13:48:07,824 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2437-1721900483
2024-07-25 13:50:49,105 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2437-1721900483
2024-07-25 13:50:49,246 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2432-1721900444
2024-07-25 13:53:22,013 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2432-1721900444
2024-07-25 13:53:24,564 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2436-1721899836
2024-07-25 13:55:57,209 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2436-1721899836
2024-07-25 13:55:58,582 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2429-1721900660
2024-07-25 13:57:18,687 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2429-1721900660

I also remember seeing in the client logs some polling for the lock when it was not immediately available.
Since we’ve gone back to the per-job ssd-cache (i.e. in a per-job /tmp that is removed after the job), the previously seen job failurs as described above are gone.

The cryosparc version we’re running is v4.5.1, since end of May. We did the update to v4.5.1 and cache reconfig at the same time.

Thanks @ebirn for these details.
Please can you confirm that for all cluster lanes of your CryoSPARC instance(s), cluster jobs truly get terminated if a CryoSPARC user uses the Kill Job action, or is there a chance that CryoSPARC jobs thought by the user to have been killed (and recorded as "killed" in the CryoSPARC database) might still be running on the cluster node and interfere with newer CryoSPARC jobs?
If you would like to troubleshoot the issue further, you may want to

  1. revert to CRYOSPARC_CACHE_LOCK_STRATEGY="master"
  2. disable per-job caching
  3. when you encounter a cache-related error
    • post the Traceback here
    • send us the tgz file created by the command
      cryosparcm snaplogs and the relevant job report

Hi,
I think we don’t have enough load at the moment to reproduce this, as it is vacation time. I’ll put this in my calender to take it back up in September, it should be easier then.
(I did a quick check, but most of the logs have been rotated so much, there is quite surely no records of those previous events left).

Best,
Erich

1 Like

Hi @wtempel ,
We’re ready to continue this investigation on our end, and we’ve planned an update to 4.5.3+patch for the end of next week. Does it still make sense to investigate cache issues in this release, or should we first upgrade to 4.6 (have there been significant changes to the caching mechanism?)

Best,
Erich

@ebirn. There were no significant changes between v4.5.3+patch to v4.6 that should affect your cache tests.

Hi, we;ve updated to v4.6 today, and reverted to locking on the master with cache on the shared filesystem. We will keep you posted if any errors occur.

We have the same problem with our instance. The cache is on a shared BeeGFS file system and when the load is high, jobs stall during caching at the stage of:

SSD cache ACTIVE at /scratch/burst/cryosparc/instance_donatello:29441 (10 GB reserve)
  Checking files on SSD ...

And eventually fails with the error:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 142, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 821, in run
    return run_with_executor(rel_sources, executor)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 859, in run_with_executor
    state = drive.allocate(sources, active_run_ids=info["active_run_ids"])
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 621, in allocate
    self.create_run_links(sources)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 520, in create_run_links
    link.symlink_to(f"../../{STORE_DIR}/{source.key_prefix}/{source.key}")
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/pathlib.py", line 1255, in symlink_to
    self._accessor.symlink(target, self, target_is_directory)
FileNotFoundError: [Errno 2] No such file or directory: '../../store-v2/3c/3c87bfd6f69494817c1e97f7a829db683b4f0c0b' -> '/scratch/burst/cryosparc/instance_donatello:29441/links/P356-J94-1727269140/3c87bfd6f69494817c1e97f7a829db683b4f0c0b.mrc'

Repeatedly doing clear/build/queue on failed jobs will eventually get it to run.

cryosparc_worker/config.sh contains:

export CRYOSPARC_CACHE_NUM_THREADS=8
export CRYOSPARC_CACHE_LOCK_STRATEGY=master

The cluster_info.json contains:

"cache_path" : "/scratch/burst/cryosparc",

@daniel.s.d.larsson What version of CryoSPARC do you use? Do the jobs with cache-related errors involve imported particles as described in Particles cache issues - #18 by nfrasser ?