Cached files of running jobs deleted

@fbeck The message

indicates the search for files that can be deleted, but does not indicate the actual deletion.
Did you encounter any indications of actual deletion of cached files that were in use by a running job?

Yes we have the problem that cached data of running jobs get deleted.

best

Florian

Please can you provide additional details

  • the CryoSPARC version
  • output of this icli command
    cryosparcm icli # open interactive cli session
    [(t['name'], t.get('cache_path', 'no_cache')) for t in cli.get_scheduler_targets()]
    
  • any indicators, like error messages, etc. of cache files being deleted while job is running
  • is /fs/pool/pool-briggs-scratch/cryosparc shared between worker nodes?

Cryosparc Version: 4.2.1

output of icli:
[(t[‘name’], t.get(‘cache_path’, ‘no_cache’)) for t in cli.get_scheduler_targets()]
Out[1]:
[(‘hpcl8001’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl8002’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl8003’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl8004’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl9001’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl9002’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl7001’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl5005’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl5007’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl5008’, ‘/fs/pool/pool-briggs-scratch/cryosparc’),
(‘hpcl5010’, ‘/fs/pool/pool-briggs-scratch/cryosparc’)]

/fs/pool/pool-briggs-scratch/cryosparc is shared between workers.

Failed Job event log

P84 J875: Ab-Initio Reconstruction
CryoSPARC Running Version: v4.2.1
Project Details
±----------------±----------------------------------------+
| DETAIL | VALUE |
±----------------±----------------------------------------+
| Title | SARS-CoV2_Matrix |
±----------------±----------------------------------------+
| Description | All the lovely and deadly CoV2 variants |
±----------------±----------------------------------------+
| ID | P84 |
±----------------±----------------------------------------+
| Created by user | U20F U20L |
±----------------±----------------------------------------+
Job Details
±----------------±-----------------------------------------------------------------+
| DETAIL | VALUE |
±----------------±-----------------------------------------------------------------+
| Title | New Job J875 |
±----------------±-----------------------------------------------------------------+
| Description | Enter a description. |
±----------------±-----------------------------------------------------------------+
| ID | J875 |
±----------------±-----------------------------------------------------------------+
| Job type | Ab-Initio Reconstruction |
±----------------±-----------------------------------------------------------------+
| Status | failed |
±----------------±-----------------------------------------------------------------+
| Created by user | U20F U20L |
±----------------±-----------------------------------------------------------------+
| Created | Fri Jun 30 2023 15:00:27 GMT+0200 (Central European Summer Time) |
±----------------±-----------------------------------------------------------------+
| Queued | Fri Jun 30 2023 15:01:34 GMT+0200 (Central European Summer Time) |
±----------------±-----------------------------------------------------------------+
| Launched | Fri Jun 30 2023 15:01:35 GMT+0200 (Central European Summer Time) |
±----------------±-----------------------------------------------------------------+
| Started | Fri Jun 30 2023 15:01:43 GMT+0200 (Central European Summer Time) |
±----------------±-----------------------------------------------------------------+
| Waiting | - |
±----------------±-----------------------------------------------------------------+
| Killed | - |
±----------------±-----------------------------------------------------------------+
| Completed | - |
±----------------±-----------------------------------------------------------------+
| Failed | Fri Jun 30 2023 15:36:43 GMT+0200 (Central European Summer Time) |
±----------------±-----------------------------------------------------------------+
| Last accessed | Tue Jul 11 2023 10:38:29 GMT+0200 (Central European Summer Time) |
±----------------±-----------------------------------------------------------------+
| Size | 0 Bytes |
±----------------±-----------------------------------------------------------------+
Inputs
particles
J874.split_0
±--------------------+
| BLOB |
±--------------------+
| J874.split_0.blob.F |
±--------------------+
±-------------------+
| CTF |
±-------------------+
| J874.split_0.ctf.F |
±-------------------+
±----------------------------+
| PASSTHROUGH |
±----------------------------+
| J874.split_0.location.F |
±----------------------------+
| J874.split_0.alignments2D.F |
±----------------------------+
Parameters
Particle preprocessing
±----------------------------±------±--------±-----±---------+
| PARAMETER | VALUE | DEFAULT | SPEC | ADVANCED |
±----------------------------±------±--------±-----±---------+
| Window dataset (real-space) | Set | X | X | |
±----------------------------±------±--------±-----±---------+
| Window inner radius | 0.85 | X | X | X |
±----------------------------±------±--------±-----±---------+
| Window outer radius | 0.99 | X | X | X |
±----------------------------±------±--------±-----±---------+
Ab-Initio reconstruction
±-------------------------------------------±----------±--------±-----±---------+
| PARAMETER | VALUE | DEFAULT | SPEC | ADVANCED |
±-------------------------------------------±----------±--------±-----±---------+
| Number of Ab-Initio classes | 1 | | | |
±-------------------------------------------±----------±--------±-----±---------+
| Num particles to use | Not Set | X | X | |
±-------------------------------------------±----------±--------±-----±---------+
| Maximum resolution (Angstroms) | 12 | X | X | |
±-------------------------------------------±----------±--------±-----±---------+
| Initial resolution (Angstroms) | 35 | X | X | |
±-------------------------------------------±----------±--------±-----±---------+
| Number of initial iterations | 200 | | | X |
±-------------------------------------------±----------±--------±-----±---------+
±-------------------------------------------±----------±--------±-----±---------+
| Number of final iterations | 300 | | | X |
±-------------------------------------------±----------±--------±-----±---------+
| Fourier radius step | 0.04 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Window structures in real space | Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Center structures in real space | Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Correct for per-micrograph optimal scales | Not set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Compute per-image optimal scales | Not set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| SGD Momentum | 0 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Sparsity prior | 0 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Initial minibatch size | 90 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Final minibatch size | 300 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Abinit minisize epsilon | 0.05 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Abinit minisize minp | 0.01 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Initial minibatch size num iters | 300 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Noise model (white, symmetric or coloured) | symmetric | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Noise priorw | 50 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Noise initw | 5000 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Noise initial sigma-scale | Not Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Class similarity | 0.1 | X | X | |
±-------------------------------------------±----------±--------±-----±---------+
| Class similarity anneal start iter | 300 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Class similarity anneal end iter | 350 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Target 3D ESS Fraction | 0.011 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Symmetry | C1 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Initial learning rate duration | 100 | | | X |
±-------------------------------------------±----------±--------±-----±---------+
| Initial learning rate | 0.4 | | | X |
±-------------------------------------------±----------±--------±-----±---------+
| Enforce non-negativity | Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Ignore DC component | Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Initial structure random seed | Not Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Initial structure lowpass (Fourier radius) | 7 | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Use fast codepaths | Set | X | X | X |
±-------------------------------------------±----------±--------±-----±---------+
| Show plots from intermediate steps | Set | X | X | |
±-------------------------------------------±----------±--------±-----±---------+
Random seeds
±------------±--------±--------±-----±---------+
| PARAMETER | VALUE | DEFAULT | SPEC | ADVANCED |
±------------±--------±--------±-----±---------+
| Random seed | Not Set | X | X | X |
±------------±--------±--------±-----±---------+
Compute settings
±-----------------------------±------±--------±-----±---------+
| PARAMETER | VALUE | DEFAULT | SPEC | ADVANCED |
±-----------------------------±------±--------±-----±---------+
| Cache particle images on SSD | Set | X | X | |
±-----------------------------±------±--------±-----±---------+
Outputs
particles_all_classes
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS_CLASS_0 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_1 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_2 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_3 |
±---------------------------+
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_4 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_5 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_6 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_7 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_8 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±---------------------------+
| ALIGNMENTS_CLASS_9 |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
particles_class_0
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_0
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_1
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
±---------------------------+
volume_class_1
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_2
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_2
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_3
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_3
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_4
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
±---------------------------+
volume_class_4
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_5
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_5
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_6
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_6
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_7
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
±---------------------------+
volume_class_7
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_8
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_8
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_class_9
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±---------------------------+
| ALIGNMENTS3D |
±---------------------------+
| J875.particle.alignments3D |
±---------------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
volume_class_9
±-----------------+
| MAP |
±-----------------+
| J875.volume.blob |
±-----------------+
particles_unused
±-------------------+
| BLOB |
±-------------------+
| J875.particle.blob |
±-------------------+
±------------------+
| CTF |
±------------------+
| J875.particle.ctf |
±------------------+
±-----------------------+
| LOCATION (PASSTHROUGH) |
±-----------------------+
| J875.particle.location |
±-----------------------+
±---------------------------+
| ALIGNMENTS2D (PASSTHROUGH) |
±---------------------------+
| J875.particle.alignments2D |
±---------------------------+
Target
hpcl9002 (node)

[2023-06-30 15:01:35.84] License is valid.
[2023-06-30 15:01:35.84] Launching job on lane h9002-chkGPU target hpcl9002 …
[2023-06-30 15:01:35.90] Running job on remote worker node hostname hpcl9002
[2023-06-30 15:01:43.63] [CPU: 170.3 MB] [Avail: 986.95 GB] Job J875 Started
[2023-06-30 15:01:43.69] [CPU: 170.5 MB] [Avail: 986.95 GB] Master running v4.2.1, worker running v4.2.1
[2023-06-30 15:01:43.72] [CPU: 170.5 MB] [Avail: 986.95 GB] Working in directory: /fs/pool/pool-cryosparc/users/user20/P84/J875
[2023-06-30 15:01:43.73] [CPU: 170.5 MB] [Avail: 986.95 GB] Running on lane h9002-chkGPU
[2023-06-30 15:01:43.73] [CPU: 170.5 MB] [Avail: 986.95 GB] Resources allocated:
[2023-06-30 15:01:43.74] [CPU: 170.5 MB] [Avail: 986.94 GB] Worker: hpcl9002
[2023-06-30 15:01:43.74] [CPU: 170.5 MB] [Avail: 986.94 GB] CPU : [4, 5]
[2023-06-30 15:01:43.74] [CPU: 170.5 MB] [Avail: 986.94 GB] GPU : [1]
[2023-06-30 15:01:43.75] [CPU: 170.5 MB] [Avail: 986.94 GB] RAM : [3]
[2023-06-30 15:01:43.75] [CPU: 170.5 MB] [Avail: 986.94 GB] SSD : True
[2023-06-30 15:01:43.76] [CPU: 170.5 MB] [Avail: 986.94 GB] --------------------------------------------------------------
[2023-06-30 15:01:43.76] [CPU: 170.5 MB] [Avail: 986.94 GB] Importing job module for job type homo_abinit…
[2023-06-30 15:01:47.28] [CPU: 255.4 MB] [Avail: 986.84 GB] Job ready to run
[2023-06-30 15:01:47.29] [CPU: 255.4 MB] [Avail: 986.84 GB] ***************************************************************
[2023-06-30 15:01:47.76] [CPU: 303.8 MB] [Avail: 986.79 GB] Using random seed for sgd of 269567159
[2023-06-30 15:01:47.77] [CPU: 313.6 MB] [Avail: 986.78 GB] Loading a ParticleStack with 100000 items…
[2023-06-30 15:01:50.25] [CPU: 313.7 MB] [Avail: 986.79 GB] SSD cache : cache successfully synced in_use
[2023-06-30 15:03:04.52] [CPU: 356.0 MB] [Avail: 986.08 GB] SSD cache : cache successfully synced, found 17753061.65MB of files
on SSD.
[2023-06-30 15:03:07.74] [CPU: 356.0 MB] [Avail: 986.10 GB] SSD cache : cache successfully requested to check 2984 files.
[2023-06-30 15:03:11.56] [CPU: 356.0 MB] [Avail: 986.12 GB] SSD cache : cache requires 229.79MB more on the SSD for files to be
downloaded.
[2023-06-30 15:03:37.89] [CPU: 356.0 MB] [Avail: 986.16 GB] SSD cache : cache has enough available space.
[2023-06-30 15:03:37.90] [CPU: 356.0 MB] [Avail: 986.16 GB] Transferring
J244/extract/012702829538677884369_FoilHole_5257542_Data_3069349_30
69351_20211106_002759_EER_particles.mrc (1 MB) (2029/2984)? Comple
te : 230 MB (1.19%)? Total : 19314 MB
? Current Speed : 26.15 MB/s? Average Speed : 58.94 MB/
s? ETA : 0h 5m 23s
[2023-06-30 15:03:41.98] [CPU: 356.0 MB] [Avail: 986.11 GB] SSD cache : complete, all requested files are available on SSD.
[2023-06-30 15:04:09.73] [CPU: 388.1 MB] [Avail: 986.04 GB] Done.
[2023-06-30 15:04:09.74] [CPU: 388.2 MB] [Avail: 986.04 GB] Windowing particles
[2023-06-30 15:04:09.74] [CPU: 388.2 MB] [Avail: 986.04 GB] Done.
[2023-06-30 15:04:09.75] [CPU: 388.2 MB] [Avail: 986.04 GB] Using 10 classes.
[2023-06-30 15:04:09.85] [CPU: 483.3 MB] [Avail: 985.95 GB] Computing Ab-Initio Structure:
[2023-06-30 15:04:09.85] [CPU: 483.3 MB] [Avail: 985.95 GB] Volume Size: 64 (voxel size

6.819 S: 8.979 Class Size: 11.3% (Average: 13.9%)

[2023-06-30 15:36:40.73] [CPU: 1.19 GB] [Avail: 985.07 GB] – Class 8 – lr: 0.20 eps: 5.52 step ratio : 0.4091 ESS R:
6.960 S: 9.072 Class Size: 11.5% (Average: 10.6%)
[2023-06-30 15:36:40.75] [CPU: 1.19 GB] [Avail: 985.07 GB] – Class 9 – lr: 0.20 eps: 5.52 step ratio : 0.2611 ESS R:
9.060 S: 9.712 Class Size: 4.7% (Average: 8.4%)
[2023-06-30 15:36:41.69] [CPU: 1.21 GB] [Avail: 985.07 GB] Done iteration 00507 of 01379 in 6.153s. Total time 1948.1s. Est
time remaining 5868.6s.
[2023-06-30 15:36:41.74] [CPU: 1.21 GB] [Avail: 985.07 GB] ----------- Iteration 508 (epoch 0.897). radwn 18.32 resolution
13.00A minisize 300 beta 0.00
[2023-06-30 15:36:43.88] [CPU: 1.13 GB] [Avail: 985.13 GB] Traceback (most recent call last):? File
“cryosparc_master/cryosparc_compute/run.py”, line 96, in
cryosparc_compute.run.main? File
“cryosparc_master/cryosparc_compute/jobs/abinit/run.py”, line 309,
in cryosparc_compute.jobs.abinit.run.run_homo_abinit? File
“cryosparc_master/cryosparc_compute/engine/engine.py”, line 1142,
in cryosparc_compute.engine.engine.process? File
“cryosparc_master/cryosparc_compute/engine/engine.py”, line 1143,
in cryosparc_compute.engine.engine.process? File
“cryosparc_master/cryosparc_compute/engine/engine.py”, line 1028,
in cryosparc_compute.engine.engine.process.work? File
“cryosparc_master/cryosparc_compute/engine/engine.py”, line 87, in
cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu?
File
“/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.2.1/cry
osparc_worker_hpcl900x/cryosparc_compute/particles.py”, line 33, in
get_original_real_data? return self.blob.view().copy()? File "
/fs/gpfs41/lv07/fileset02/home/b_baumei/cryosparcuser/csV4.2.1/cryo
sparc_worker_hpcl900x/cryosparc_compute/blobio/mrc.py", line 127, i
n view? return self.get()? File “/fs/gpfs41/lv07/fileset02/home
/b_baumei/cryosparcuser/csV4.2.1/cryosparc_worker_hpcl900x/cryospar
c_compute/blobio/mrc.py”, line 122, in get? _, data, total_time
= prefetch.synchronous_native_read(self.fname, idx_start = self.pag
e, idx_limit = self.page+1)? File “cryosparc_master/cryosparc_comp
ute/blobio/prefetch.py”, line 68, in cryosparc_compute.blobio.prefe
tch.synchronous_native_read?RuntimeError: Error ocurred (No such fi
le or directory) at line 548 in fopen??Could not open file. Make su
re the path exists.??IO request details:?filename: /fs/pool/pool
-briggs-scratch/cryosparc/instance_brcryosparc:38001/projects/P84/J
244/extract/017623285098820666075_FoilHole_6220816_Data_3069346_306
9348_20211106_022130_EER_particles.mrc?filetype: 0?header_only:
0?idx_start: 31?idx_limit: 32?eer_upsampfactor: 2?eer_numfracti
ons: 40?num_threads: 6?buffer: (nil)?nx, ny, nz: 0 0 0?dtype:
0?total_time: -1.000000

Thanks @fbeck for posting additional details.
The CryoSPARC cache system does not currently support a configuration where

  1. multiple connected workers share the same cache_path attribute
  2. and the underlying cache storage is shared

We propose the following workaround (execute commands under the Linux account that runs CryoSPARC processes):

  1. stop CryoSPARC (this would disrupt currently runningCryoSPARC jobs)
  2. delete the instance_${CRYOSPARC_MASTER_HOSTNAME}:$[${CRYOSPARC_BASE_PORT}+1] subdirectory inside the /fs/pool/pool-briggs-scratch/cryosparc directory
  3. for each worker,
  • create a cache subdirectory named after the workers hostname, like
mkdir /fs/pool/pool-briggs-scratch/cryosparc/hpcl8001
  • update the cache_path property, like
cryosparcm cli "set_scheduler_target_property('hpcl8001', 'cache_path', '/fs/pool/pool-briggs-scratch/cryosparc/hpcl8001')"
  1. cryosparcm start
  2. reset the cache system (see guide).

Do these steps resolve the issue?

Will this change in 4.3?

This will likely not change in v4.3.

Hi,

but having an individual cache for each worker will remove all advantages of a
shared cache (storage size, network traffic etc …).
What is exactly the problem with the shared cache ?
(For smaller datasets it works.)
As we have a lot of big projects this would really be a problem for us.
So if there is a solution in the near future it would really help us.

Is it the same for jobs submitted to a slurm cluster ?
Can they share the cache ?

thanks for helping

Florian

I suspect that the problem arises when cache is full, and one job is trying to free some space and deletes files from other jobs and it may happen that those jobs are still running.
I didn’t see it being a problem yet, probably because we have quite a large cache quota for each user (10TB), two-week retention and we migrated to a new cluster only a few months ago (on the previous machine we didn’t use cache), but it’s only a matter of time. Huge +1 for supporting cache space shared between workers.

Our use case for that is nvme-based lustre, that is much smaller globally (and has global data retention) than HDD-based lustre, so whole projects cannot be stored there.

Hi,

I’m pretty sure that there was still enough space.
Is there a way that can check (from the log) why cached
files get removed.
Can I configure that only cached files which are not used get removed ?
How does the master process decide which files get removed ?

thank you so much for your help

Florian

The problem arises due to a combination of how the current cache system tracks and clears for deletion cached files. We are looking for a design that supports the shared cache case without deteriorating support for the host-specific cache case.

1 Like

Hello again :slight_smile:

Does the new caching subsystem in 4.4 support the case of cache location shared between workers? The description of the feature isn’t very detailed. Can you elaborate a bit what has been changed?

@bsobol The new cache system implements a different logic to track cached files compared to the older cache system. The new cache logic is compatible with cache storage that is shared between workers.

1 Like

That’s a great news. Thanks!

Dear Cryosparc Team,

I tried the CRYOSPARC_CACHE_LOCK_STRATEGY=“master” but we still have the Problem
that we get the No such File or directory error as soon as more than 4 big jobs are running in paralell.
We have a central scratch for all workers with gpfs as filesystem.

-Is there anything we can do to improve the situation ?
-Could you give some insights where the problem with the paralell fiesystems is
comming from ? (They also provide posix locks)
-Do you know any site where Cryosparc is running with a central cache and many paralell jobs/users ?

cat cryosparc_worker_hpcl930x/config.sh

export CRYOSPARC_LICENSE_ID=“xxxx”
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CACHE_LOCK_STRATEGY=“master”
export CRYOSPARC_IMPROVED_SSD_CACHE=true
export CRYOSPARC_CACHE_NUM_THREADS=12

thanks

Florian

yes this happens to us as well. People simply restart jobs and at times it goes through. But we also had cases where jobs failed due to caching error after running for 30 hours.

We have mix of local GPU nodes as well as cluster nodes. But they share central scratch for all workers. This happens in both cluster nodes as well in local GPU boxes thats not part of the submission system.

Kindly advice how to tackle this issue.

@fbeck @Rajan What version of CryoSPARC do you use? Do the jobs with cache-related errors involve imported particles as described in Particles cache issues - #18 by nfrasser ?
@Rajan Do you cluster and “local” nodes share the same cryosparc_worker/ installation? What is the output of the command

/path/to/cryosparc_worker/bin/cryosparcw env | grep LOCK

for each independent cryosparc_worker/ installation?

Hi

we run cryoparc version 4.5.1.
Most of the jobs have imported particles from relion.
But the error also occurs for jobs which only run in cryosparc.
(No particle import)
We also just get the particle not found error and it takes very long
time until the job starts.

best

Florian

Worker 1:
./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”
Worker 2:
./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”
Worker 3:
./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

Hi,

Our Cryosparc version is 4.5.3

Worker 1: Local Nodes ( 2 nodes they share the same cryosparc_worker installation )

./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

Worker 2: Cluster Nodes ( one installation for all nodes )

./bin/cryosparcw env | grep LOCK
export “CRYOSPARC_CACHE_LOCK_STRATEGY=master”

For the local nodes as well as cluster SSD cache is common.

Thanks for all the help. I hope we can can resolve it soon.

Best
Rajan

@Rajan @fbeck When you again encounter an error due to a file missing on cache, please can you run the command
cryosparcm snaplogs and email us the tgz file that the command produces. I will send you a direct message about the email address.