Particles cache issues

Hi @nfrasser,

We updated to v4.5.1 and reverted to per-job caching at some point in May. We are still having the same error messages (please see the initial post). Again, it seems to correlate with two factors:

  1. Multiple jobs accessing the same cached particles
  2. Overall higher trafic on the cache by the CS-instance

Both leading to increased amount of failed jobs with the same error message.

I guess that we will try to go back to a per-job caching strategy, unless you have another idea.
As always we are happy to help by sharing any requested further information or any other means.

Cheers,
Lorenz

@DerLorenz Please can you confirm whether:

  1. the cache filesystem is shared between worker hosts
  2. has been added to cryosparc_worker/config.sh of each worker.

Hi,
I can confirm this is the case on the machine.
The cryosparc worker is installed in the (cryosparc) user’s home, that is a shared filesystem on the cluster.
We had the config

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"
# single global cache path, requires "master" strategy
export CRYOSPARC_SSD_PATH="/shared/beegfs/path/lane_cache"

I’ve also observed in the logs, that the worker ist polling for the lock on the master and eventually acquiring it.
These settings were active roughly since beginning of June. We have only encountered the issue more recently, since 2 weeks or so.

As a workaround, we’ve now switched to per-job caching, by doing this:

  • removed the CRYOSPARC_CACHE_LOCK_STRATEGY stanza
  • set CRYOSPARC_SSD_PATH="/tmp"

The /tmp filesystem is provisioned per job, i.e. for every new cryosparc job, there will always be an empty cache. Note: In both cases we have set these options:

export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=20
# threads for cache copy stage
export CRYOSPARC_CACHE_NUM_THREADS=8
# new caching system
export CRYOSPARC_IMPROVED_SSD_CACHE=true

Best,
Erich

Thanks for this information @ebirn.
Please can you confirm that/whether

  1. @ebirn and @DerLorenz are referring to the same cluster and CryoSPARC installation
  2. per-job caching is an effective workaround for FileNotFoundError

Do I interpret this correctly in that caching on /shared/beegfs/path/lane_cache worked for a few weeks in June and July? Did the (re-)emergence of the FileNotFoundError(?) issue coincide with any software or other changes?
It may be helpful to tabulate configuration changes and cache errors like:

time frame CACHE_LOCK_STRATEGY cache storage per-job cache version other changes FileNotFound
January not set beegfs no 4.4.1 ? yes
May
June
mid July
end July not set ? yes ? ?

(markdown table template)

| time frame | `CACHE_LOCK_STRATEGY` |cache storage|per-job cache |version |other changes| `FileNotFound` |
|-|-|-|-|-|-|-|
|January|not set|beegfs|no|4.4.1|?|yes|
|...|||||||
|May|||||||
|June|||||||
|mid July|||||||
|end July|not set|?|yes|?|?||

Hi @wtempel

Yes, we are on the same machine, @DerLorenz as scientific user, me as operator.
I know that there were no config changes, as we deploy and manage such changes with config management and automation tools.

After the change end of may (to the master lock strategy), for a while there were no errors observed. Later, the daily number of jobs on the cryosparc machine increased. And those errors started to happen again (with the master lock strategy still enabled, no config changes were done).

When it happened, I also did see the lock acquisition logging (until we’ve removed the master lock strategy). The last entries of lock/unlock:

2024-07-25 13:33:44,833 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2432-1721899456
2024-07-25 13:35:06,275 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2432-1721899456
2024-07-25 13:35:07,994 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2433-1721899493
2024-07-25 13:36:25,284 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2433-1721899493
2024-07-25 13:36:26,801 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2429-1721899430
2024-07-25 13:37:47,330 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2429-1721899430
2024-07-25 13:37:48,061 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2434-1721899791
2024-07-25 13:39:08,770 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2434-1721899791
2024-07-25 13:39:11,449 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2435-1721899809
2024-07-25 13:40:32,694 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2435-1721899809
2024-07-25 13:40:34,180 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2430-1721899850
2024-07-25 13:43:04,791 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2430-1721899850
2024-07-25 13:43:05,245 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2431-1721899943
2024-07-25 13:45:37,425 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2431-1721899943
2024-07-25 13:45:38,175 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2433-1721900542
2024-07-25 13:48:07,600 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2433-1721900542
2024-07-25 13:48:07,824 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2437-1721900483
2024-07-25 13:50:49,105 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2437-1721900483
2024-07-25 13:50:49,246 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2432-1721900444
2024-07-25 13:53:22,013 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2432-1721900444
2024-07-25 13:53:24,564 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2436-1721899836
2024-07-25 13:55:57,209 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2436-1721899836
2024-07-25 13:55:58,582 job_run_lock         INFO     | Lock ssd_cache acquired by P150-J2429-1721900660
2024-07-25 13:57:18,687 job_run_unlock       INFO     | Releasing lock ssd_cache from P150-J2429-1721900660

I also remember seeing in the client logs some polling for the lock when it was not immediately available.
Since we’ve gone back to the per-job ssd-cache (i.e. in a per-job /tmp that is removed after the job), the previously seen job failurs as described above are gone.

The cryosparc version we’re running is v4.5.1, since end of May. We did the update to v4.5.1 and cache reconfig at the same time.

Thanks @ebirn for these details.
Please can you confirm that for all cluster lanes of your CryoSPARC instance(s), cluster jobs truly get terminated if a CryoSPARC user uses the Kill Job action, or is there a chance that CryoSPARC jobs thought by the user to have been killed (and recorded as "killed" in the CryoSPARC database) might still be running on the cluster node and interfere with newer CryoSPARC jobs?
If you would like to troubleshoot the issue further, you may want to

  1. revert to CRYOSPARC_CACHE_LOCK_STRATEGY="master"
  2. disable per-job caching
  3. when you encounter a cache-related error
    • post the Traceback here
    • send us the tgz file created by the command
      cryosparcm snaplogs and the relevant job report

Hi,
I think we don’t have enough load at the moment to reproduce this, as it is vacation time. I’ll put this in my calender to take it back up in September, it should be easier then.
(I did a quick check, but most of the logs have been rotated so much, there is quite surely no records of those previous events left).

Best,
Erich

1 Like