Particles cache issues

Hi @nfrasser,

We updated to v4.5.1 and reverted to per-job caching at some point in May. We are still having the same error messages (please see the initial post). Again, it seems to correlate with two factors:

  1. Multiple jobs accessing the same cached particles
  2. Overall higher trafic on the cache by the CS-instance

Both leading to increased amount of failed jobs with the same error message.

I guess that we will try to go back to a per-job caching strategy, unless you have another idea.
As always we are happy to help by sharing any requested further information or any other means.

Cheers,
Lorenz

@DerLorenz Please can you confirm whether:

  1. the cache filesystem is shared between worker hosts
  2. has been added to cryosparc_worker/config.sh of each worker.

Hi,
I can confirm this is the case on the machine.
The cryosparc worker is installed in the (cryosparc) user’s home, that is a shared filesystem on the cluster.
We had the config

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"
# single global cache path, requires "master" strategy
export CRYOSPARC_SSD_PATH="/shared/beegfs/path/lane_cache"

I’ve also observed in the logs, that the worker ist polling for the lock on the master and eventually acquiring it.
These settings were active roughly since beginning of June. We have only encountered the issue more recently, since 2 weeks or so.

As a workaround, we’ve now switched to per-job caching, by doing this:

  • removed the CRYOSPARC_CACHE_LOCK_STRATEGY stanza
  • set CRYOSPARC_SSD_PATH="/tmp"

The /tmp filesystem is provisioned per job, i.e. for every new cryosparc job, there will always be an empty cache. Note: In both cases we have set these options:

export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=20
# threads for cache copy stage
export CRYOSPARC_CACHE_NUM_THREADS=8
# new caching system
export CRYOSPARC_IMPROVED_SSD_CACHE=true

Best,
Erich

Thanks for this information @ebirn.
Please can you confirm that/whether

  1. @ebirn and @DerLorenz are referring to the same cluster and CryoSPARC installation
  2. per-job caching is an effective workaround for FileNotFoundError

Do I interpret this correctly in that caching on /shared/beegfs/path/lane_cache worked for a few weeks in June and July? Did the (re-)emergence of the FileNotFoundError(?) issue coincide with any software or other changes?
It may be helpful to tabulate configuration changes and cache errors like:

time frame CACHE_LOCK_STRATEGY cache storage per-job cache version other changes FileNotFound
January not set beegfs no 4.4.1 ? yes
May
June
mid July
end July not set ? yes ? ?

(markdown table template)

| time frame | `CACHE_LOCK_STRATEGY` |cache storage|per-job cache |version |other changes| `FileNotFound` |
|-|-|-|-|-|-|-|
|January|not set|beegfs|no|4.4.1|?|yes|
|...|||||||
|May|||||||
|June|||||||
|mid July|||||||
|end July|not set|?|yes|?|?||

Hi @wtempel

Yes, we are on the same machine, @DerLorenz as scientific user, me as operator.
I know that there were no config changes, as we deploy and manage such changes with config management and automation tools.

After the change end of may (to the master lock strategy), for a while there were no errors observed. Later, the daily number of jobs on the cryosparc machine increased. And those errors started to happen again (with the master lock strategy still enabled, no config changes were done).

When it happened, I also did see the lock acquisition logging (until we’ve removed the master lock strategy). The last entries of lock/unlock:

2024-07-25 13:33:44,833 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2432-1721899456
2024-07-25 13:35:06,275 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2432-1721899456
2024-07-25 13:35:07,994 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2433-1721899493
2024-07-25 13:36:25,284 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2433-1721899493
2024-07-25 13:36:26,801 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2429-1721899430
2024-07-25 13:37:47,330 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2429-1721899430
2024-07-25 13:37:48,061 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2434-1721899791
2024-07-25 13:39:08,770 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2434-1721899791
2024-07-25 13:39:11,449 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2435-1721899809
2024-07-25 13:40:32,694 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2435-1721899809
2024-07-25 13:40:34,180 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2430-1721899850
2024-07-25 13:43:04,791 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2430-1721899850
2024-07-25 13:43:05,245 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2431-1721899943
2024-07-25 13:45:37,425 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2431-1721899943
2024-07-25 13:45:38,175 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2433-1721900542
2024-07-25 13:48:07,600 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2433-1721900542
2024-07-25 13:48:07,824 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2437-1721900483
2024-07-25 13:50:49,105 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2437-1721900483
2024-07-25 13:50:49,246 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2432-1721900444
2024-07-25 13:53:22,013 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2432-1721900444
2024-07-25 13:53:24,564 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2436-1721899836
2024-07-25 13:55:57,209 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2436-1721899836
2024-07-25 13:55:58,582 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2429-1721900660
2024-07-25 13:57:18,687 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2429-1721900660

I also remember seeing in the client logs some polling for the lock when it was not immediately available.
Since we’ve gone back to the per-job ssd-cache (i.e. in a per-job /tmp that is removed after the job), the previously seen job failurs as described above are gone.

The cryosparc version we’re running is v4.5.1, since end of May. We did the update to v4.5.1 and cache reconfig at the same time.

Thanks @ebirn for these details.
Please can you confirm that for all cluster lanes of your CryoSPARC instance(s), cluster jobs truly get terminated if a CryoSPARC user uses the Kill Job action, or is there a chance that CryoSPARC jobs thought by the user to have been killed (and recorded as "killed" in the CryoSPARC database) might still be running on the cluster node and interfere with newer CryoSPARC jobs?
If you would like to troubleshoot the issue further, you may want to

  1. revert to CRYOSPARC_CACHE_LOCK_STRATEGY="master"
  2. disable per-job caching
  3. when you encounter a cache-related error
    • post the Traceback here
    • send us the tgz file created by the command
      cryosparcm snaplogs and the relevant job report

Hi,
I think we don’t have enough load at the moment to reproduce this, as it is vacation time. I’ll put this in my calender to take it back up in September, it should be easier then.
(I did a quick check, but most of the logs have been rotated so much, there is quite surely no records of those previous events left).

Best,
Erich

1 Like

Hi @wtempel ,
We’re ready to continue this investigation on our end, and we’ve planned an update to 4.5.3+patch for the end of next week. Does it still make sense to investigate cache issues in this release, or should we first upgrade to 4.6 (have there been significant changes to the caching mechanism?)

Best,
Erich

@ebirn. There were no significant changes between v4.5.3+patch to v4.6 that should affect your cache tests.

Hi, we;ve updated to v4.6 today, and reverted to locking on the master with cache on the shared filesystem. We will keep you posted if any errors occur.

We have the same problem with our instance. The cache is on a shared BeeGFS file system and when the load is high, jobs stall during caching at the stage of:

SSD cache ACTIVE at /scratch/burst/cryosparc/instance_donatello:29441 (10 GB reserve)
  Checking files on SSD ...

And eventually fails with the error:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 142, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 821, in run
    return run_with_executor(rel_sources, executor)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 859, in run_with_executor
    state = drive.allocate(sources, active_run_ids=info["active_run_ids"])
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 621, in allocate
    self.create_run_links(sources)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 520, in create_run_links
    link.symlink_to(f"../../{STORE_DIR}/{source.key_prefix}/{source.key}")
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/pathlib.py", line 1255, in symlink_to
    self._accessor.symlink(target, self, target_is_directory)
FileNotFoundError: [Errno 2] No such file or directory: '../../store-v2/3c/3c87bfd6f69494817c1e97f7a829db683b4f0c0b' -> '/scratch/burst/cryosparc/instance_donatello:29441/links/P356-J94-1727269140/3c87bfd6f69494817c1e97f7a829db683b4f0c0b.mrc'

Repeatedly doing clear/build/queue on failed jobs will eventually get it to run.

cryosparc_worker/config.sh contains:

export CRYOSPARC_CACHE_NUM_THREADS=8
export CRYOSPARC_CACHE_LOCK_STRATEGY=master

The cluster_info.json contains:

"cache_path" : "/scratch/burst/cryosparc",

@daniel.s.d.larsson What version of CryoSPARC do you use? Do the jobs with cache-related errors involve imported particles as described in Particles cache issues - #18 by nfrasser ?

The CS version is 4.5.3. I never performed any particle import jobs in this project. The particle stack comes from a Downsample job, which might be a clue. The particles came from:

Template pick → Extract → 2D class → 3D refine → RBMC → Downsample

@daniel.s.d.larsson When you again encounter an error due to a file missing on cache, please can you run the command
cryosparcm snaplogs and email us the tgz file that the command produces. I will send you a direct message about the email address.

Hi all, we released CryoSPARC v4.6.1 today, which fixes a bug that causes “File not found” issues like the one that @daniel.s.d.larsson recently reported. Please update when you can and let me know if the issue is resolved for you.

Note that export CRYOSPARC_CACHE_LOCK_STRATEGY="master" must still be present in cryosparc_worker/config.sh for parallel file systems.

1 Like

I will install in the next few days and try again