Error improved shared cache

Hi,

we have a shared cache which is accessed by multiple nodes (7) Master Worker installation.
We use the improved ssd cache. I also did a full cache reset according to:

For ~20% of the started jobs we get a File not Found error see below.
The error seems to occur with a higher frequency if the load is higher (more jobs).
If you resubmit the same job it usually starts running. Sometimes multiple attempts are needed.

thanks for helping

Florian

Worker Configuration
cat config.sh

export CRYOSPARC_LICENSE_ID=“xxxxxxxxx”
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_IMPROVED_SSD_CACHE=true
export CRYOSPARC_CACHE_NUM_THREADS=6

Error Message:

[CPU: 186.4 MB Avail: 665.48 GB]
Master running v4.4.1, worker running v4.4.1
[CPU: 186.6 MB Avail: 665.48 GB]
Working in directory: /fs/pool/pool-cryosparc/users/user23/xxxxxxxx/J485

[CPU: 186.6
MB Avail: 665.48 GB]

Running on lane h9002-chkGPU

[CPU: 186.6
MB Avail: 665.48 GB]

Resources allocated:

[CPU: 186.6
MB Avail: 665.48 GB]

Worker: hpcl9002

[CPU: 186.6
MB Avail: 665.48 GB]

CPU : [8, 9]

[CPU: 186.6
MB Avail: 665.48 GB]

GPU : [2]

[CPU: 186.6
MB Avail: 665.48 GB]

RAM : [2, 6, 7]

[CPU: 186.6
MB Avail: 665.48 GB]

SSD : True

[CPU: 186.6
MB Avail: 665.48 GB]

———————————————————————————————

[CPU: 186.6
MB Avail: 665.48 GB]

Importing job module for job type class_2D_new…

[CPU: 218.3
MB Avail: 665.21 GB]

Job ready to run

[CPU: 218.3
MB Avail: 665.21 GB]


[CPU: 268.8
MB Avail: 666.04 GB]

Using random seed of 368515461

[CPU: 269.1
MB Avail: 666.04 GB]

Loading a ParticleStack with 58638 items…
[CPU: 269.1 MB Avail: 666.04 GB]
────────────────────────────────────────────────────────────── SSD cache ACTIVE at /fs/pool/pool-briggs-scratch/cryoSparc/instance_brcryosparc:xxxxx
(10 GB reserve) (52 TB quota) Checking files on SSD …

[CPU: 1.28 GB Avail: 664.07 GB]
Traceback (most recent call last): File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_master.cryosparc_compute.run.main File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”,
line 73, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.run_class_2D File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/cryosparc_compute/particles.py”, line 120, in read_blobs u_blob_paths = cache_run(u_rel_paths)
File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/cryosparc_compute/jobs/cache_v2.py”, line 796, in run return run_with_executor(rel_sources, executor) File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/cryosparc_compute/jobs/cache_v2.py”,
line 828, in run_with_executor state = drive.allocate(sources, active_run_ids=info[“active_run_ids”]) File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/cryosparc_compute/jobs/cache_v2.py”, line 612, in allocate self.create_run_links(sources)
File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/cryosparc_compute/jobs/cache_v2.py”, line 511, in create_run_links link.symlink_to(f"…/…/{STORE_DIR}/{source.key_prefix}/{source.key}") File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/pathlib.py”,
line 1384, in symlink_to self._accessor.symlink(target, self, target_is_directory) File “/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.4.1/cryosparc_worker_hpcl900x/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/pathlib.py”, line 446, in
symlink return os.symlink(a, b) FileNotFoundError: [Errno 2] No such file or directory: ‘…/…/store-v2/6f/6f0fede7b9d211ba2b2492bac7a3680ddf2f2090’ → ‘/fs/pool/pool-briggs-scratch/cryoSparc/instance_brcryosparc:38001/links/P182-J485-1712812342/6f0fede7b9d211ba2b2492bac7a3680ddf2f2090.mrc’

Just to add a bit more detail here. When more jobs are submitted, the jobs often all got stuck at the stage of checking files on SSD (SSD cache ACTIVE at /fs/pool/pool-briggs-scratch/cryoSparc/instance_brcryosparc:xxxxx
(10 GB reserve) (52 TB quota) Checking files on SSD …), sometimes even for hours, and eventually crash due to file not found error.

Thank you!

Best regards,
Hui

What is the output of the command

stat -f /fs/pool/pool-briggs-scratch/cryoSparc/

?

Hi,

here comes the output:
stat -f /fs/pool/pool-briggs-scratch/cryoSparc/
File: “/fs/pool/pool-briggs-scratch/cryoSparc/”
ID: ef0009d00000002 Namelen: 255 Type: gpfs
Block size: 4194304 Fundamental block size: 4194304
Blocks: Total: 102236160 Free: 34378155 Available: 34378155
Inodes: Total: 2048576512 Free: 2030581282

best

Florian

What type of file locking is in effect on the filesystem?
I do not have a gpfs for testing, but suspect the setting can be shown with the mmlsfs command’s -D option (docs).

Hi,

output gpfs:

fs41:~ # mmlsfs gpfs-me4024-012 -D
flag value description


-D nfs4 File locking semantics in effect
fs41:~ #

best

Florian

@fbeck We are working to address this issue in a future CryoSPARC release.

Thanks for doing this.

Florian

Hi @fbeck, thanks very much for reporting this. We’ve fixed this in the latest CryoSPARC v4.5, released May 7 2024.

Once you update, please add the following line to cryosparc_worker/config.sh to use a global lock instead of a file lock:

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"

Thanks and let me know if you have any questions!

Hi,

very cool, I have two question concerning the new
cache/lock strategy.

Do I still need the improved_ssd_cache flag ?
export CRYOSPARC_IMPROVED_SSD_CACHE=true
Do I need to reset the cache ?

thanks for fixing

Florian

No, the new cache system is enabled by default so you may remove this line.

I believe no, you do no need to.