Errors while using the Particle cache

Hello,

Some of our cyrosparc users have reported a problem where enabling the particle cache sometimes fails. The job log contains this for the actual caching:

--------------------------------------------------------------
SSD cache ACTIVE at /scratch/cryosparc_cache/instance_computer:39001 (10 GB reserve)
---------------------------------------------------
- Disk use - Amount - Cache use - Amount -
---------------------------------------------------
- Total - 5.82 TiB - Hits - 0.00 B -
- Usable - 5.81 TiB - Misses - 2.05 TiB -
- Used - 5.78 TiB - Acquired - 2.05 TiB -
- Free - 36.82 MiB - Required - 0.00 B -
---------------------------------------------------
Progress: [▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 13750/13750 (100%)
Transferred:
018445773618901923716_SomeThing_13543301_Data_13530043_10_20260114_174927_fractions_patch_aligned_doseweighted_particles.mrc
(108.00 MiB)
 Threads: 2
 Avg speed: 690.83 MiB/s
 Remaining: 0h 00m 00s (0.00 B)
 Elapsed: 0h 52m 18s
 Active jobs: P40-J106
SSD cache complete for 13750 file(s)
--------------------------------------------------------------

But then the job fails with:

Traceback (most recent call last):
 File "cli/run.py", line 105, in cli.run.run_job
 File "cli/run.py", line 210, in cli.run.run_job_function
 File "compute/jobs/class2D/run.py", line 255, in compute.jobs.class2D.run.run_class_2D
 File "/computer/cryosparc/cryosparc_worker/compute/particles.py", line 56, in get_prepared_fspace_data
 return fourier.resample_fspace(fourier.fft(self.get_prepared_real_data()), self.dataset.N)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/computer/cryosparc/cryosparc_worker/compute/particles.py", line 49, in get_prepared_real_data
 self.dataset.prepare_real_window * (self.get_original_real_data())
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/computer/cryosparc/cryosparc_worker/compute/particles.py", line 38, in get_original_real_data
 data = self.blob.view()
 ^^^^^^^^^^^^^^^^
 File "/computer/cryosparc/cryosparc_worker/compute/blobio/mrc.py", line 173, in view
 return self.get()
 ^^^^^^^^^^
 File "/computer/cryosparc/cryosparc_worker/compute/blobio/mrc.py", line 164, in get
 x, y, z, dtype, total_time, io_time, data = ioengine.sync_file_read(
 ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/computer/cryosparc/cryosparc_worker/core/ioengine/cmdbuf.py", line 187, in sync_file_read
 return await_async_file_read(iocb)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/computer/cryosparc/cryosparc_worker/core/ioengine/cmdbuf.py", line 159, in await_async_file_read
 iocb.wait()
 File "/computer/cryosparc/cryosparc_worker/core/ioengine/cmdbuf.py", line 104, in wait
 raise IOError("\n\n".join(errs))
OSError: I/O error, mrc_readmic (1) line 1023: Invalid argument
The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).
I/O request details:
 filename:
/scratch/cryosparc_cache/instance_computer:39001/links/P40-J106-1772625386/8e269673a7f71cb0673c5a296e5492bfdf82b16e.mrc
 data type: 0x10
 frames: [75:76]
 eer upsample factor: 2
 eer number of fractions: 40

Looking at the implicated symlink /scratch/cryosparc_cache/instance_computer:39001/links/P40-J106-1772625386/8e269673a7f71cb0673c5a296e5492bfdf82b16e.mrc, it’s pointing at a file in the store-v2 tree which has a non-zero size, but it occupies 0 blocks on disk, i.e. a sparse file containing no data (note the “0” at the start):

ls -ls 8e269673a7f71cb0673c5a296e5492bfdf82b16e
0 -rw-r--r-- 1 cryosparc cryosparc 168100864 Mar  5 17:02 8e269673a7f71cb0673c5a296e5492bfdf82b16e

The cache is on an xfs filesystem, backed by a ~6TB software RAID0 consisting of 2x nvme devices. The filesystem is only used by this cryosparc instance. There is only one computer in this cryosparc instance.

Can someone help, please? We’re running cryosparc 5.0.1 on Rocky 8.10.

Thanks,

Mark

Welcome to the forum @MarkDD and thanks for the report.
Please can you let us know

  1. How prevalent are the zero block files:

    • very few among many others
    • all cached mrc files associated with the job
    • all cached mrc files associated with the project?
  2. A description of the project storage for project P40:

    • local or network
    • filesystem type
    • storage device connection type, such as sata or nvme

If only a few cached files are affected, until we have identified the cause of zero block usage, you may try finding, with a command like

find /scratch/cryosparc_cache/instance_computer\:39001/store-v2/ -type f -printf "%b %p\n" | | grep -E "^0 "

zero-block files in the cache and removing them.

Hello @wtempel, thank you so much for your reply : )

To answer your questions:

  1. How prevalent are the zero block files. At the moment, the cache contains 122,612 files, of which 106 have a non-zero size but are occupying zero blocks. Their timestamps are grouped into either Mar 8 17:00, Mar 10 13:11 or Mar 20 09:37. Not all files associated with the job are affected, and the files are associated with a variety of different jobs. Sorry, I’m not sure how to relate them back to a project without the cache job symlinks. We have a second cryosparc instance for a different set of projects, which has 36 zero block files in its cache all with a timestamp of Mar 21 12:44, but no one has complained.
  2. A description of the project storage. Both cryosparc instances have two areas available, unshared local (xfs backed by hardware RAID fronted ssd) and shared remote (Lustre over ethernet, backed by hardware RAID HDDs). The project that we know is affected on the first instance is using the remote storage.

If the project storage is significant, the cryosparc instance computer logs have been showing warnings that Lustre does not support fallocate, e.g.

LustreError: 11-0: lustre-OST000a-osc-ffff8edcc2a45800: operation ost_fallocate to node 10.42.0.12@o2ib failed: rc = -524

However, there are no zero block files on the project storage for the affected project.

In the meantime, I’ll try removing the zero block files in the cache.

Thanks,

Mark

Hi @MarkDD, thanks so much for providing all those details!

I’ve been investigating this; as far as I can tell, this issue seems to be something going wrong with the remote Lustre file system, or perhaps the specific configuration you have for it. When CryoSPARC copies the file from project storage to the SSD, it seems the system reports to us that the copy is complete, even though zero bytes were copied.

Another less likely possibility is that an external program is somehow truncating the files on the cache, but we haven’t seen any evidence for this.

We’re going to make a change in a future version of CryoSPARC to better detect and handle this case, will let you know here when that’s available. In the mean time, please check your Lustre configuration for anything that may make files temporarily unavailable for read/copy operations (for reference, we use the Linux sendfile function via Python’s os.sendfile for fast transfers between file systems). If there’s nothing you can find, please continue removing the zero-block files as you find them.

Thanks and let me know if you have any questions!