Inconsistent Behaviour with cryosparc cluster mode

Hey guys,

So we have seen some inconsistent behaviour with cryoSPARC in cluster mode. When we let SLURM submit the job, sometimes it works and sometimes it doesn’t (see error output below). But when we run the job on a specific lane (without SLURM) it runs fine with no errors. Even more puzzling is the error output as it seems to be complaining about EER fractions when this isn’t even the right datatype (as EER was never used in this dataset).

A few details:
Latest version of cryoSPARC 4.6.2.
No glaring errors in:
croysparcm log command_core
or
cryosparcm log command_rtp

Traceback (most recent call last):
  File "/mnt/jobfs/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 2304, in run_with_except_hook
    run_old(*args, **kw)
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 136, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 137, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
  File "cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py", line 620, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 566, in cryosparc_master.cryosparc_compute.engine.newengine.EngineThread.read_image_data
  File "/mnt/jobfs/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/cmdbuf.py", line 87, in wait
    raise IOError('\n\n'.join(errs))
OSError: I/O error, mrc_readmic (1)  line 914: Invalid argument
The requested frame/particle cannot be accessed. The file may be corrupt, or there may be a mismatch between the file and its associated metadata (i.e. cryosparc .cs file).

I/O request details:
	filename:  /mnt/jobfs/ssd/instance_m3q000.massive.org.au:39001/links/P32-J253-1742217419/d6345b320d56a00ab9278ffb9825346a39389d46.mrcs
	data type: 0x10
	frames:    [211:212]
	eer upsample factor: 2
	eer number of fractions: 40

The error message refers to a cached file.
May I ask

  1. Is /mnt/jobfs/ssd/instance_m3q000.massive.org.au:39001/ shared between multiple hosts? If so, you may want to ensure:
    • the shared cache storage significantly outperforms the storage used for project directories. Otherwise, caching may just increase job overhead without providing cache benefits.
    • the appropriate setting for CRYOSPARC_CACHE_LOCK_STRATEGY is being applied. If a non-default setting is required, it can be specified inside cryosparc_worker/config.sh:
      export CRYOSPARC_CACHE_LOCK_STRATEGY=master
      

If you suspect corruption of the cache, you may, when no CryoSPARC jobs are running, empty the

/mnt/jobfs/ssd/instance_m3q000.massive.org.au:39001/

directory.

The EER parameters shown are defaults that may be shown even when non-EER data are being processed. You may ignore them in your case.