SSD cache clear not working on SLURM and v5.0

Hi cryoSPARC team,

we have encountered an issue with the ssh cache, and I am slightly puzzled. Other topics in here were not unfortunatelly not helpful.

Our system:

  • cyroSPARC v5.0.1
  • SLURM cluster with 2 different partitions
  • Each node in each partition has ~6.3T ssd cache available

This is the error log:

Loading a ParticleStack with 48439 items...

──────────────────────────────────────────────────────────────
SSD cache ACTIVE at /scratch/cryosparc/instance_login:61001
  Checking and allocating files on SSD ...

Encountered OS error while caching: [Errno 28] No space left on device: '/scratch/cryosparc/instance_login:61001/links/P10-J44-1773148660'; dumping info

Traceback (most recent call last):
  File "cli/run.py", line 105, in cli.run.run_job
  File "cli/run.py", line 210, in cli.run.run_job_function
  File "compute/jobs/helix/run_refine.py", line 250, in compute.jobs.helix.run_refine.run
  File "/opt/cryosparc/cryosparc_worker/compute/particles.py", line 152, in read_blobs
    u_blob_paths = cache.run(rc, u_rel_paths)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 935, in run
    return run_with_executor(rc, rel_sources, executor)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 978, in run_with_executor
    state = drive.allocate(sources, active_run_ids=info.active_run_ids)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 593, in allocate
    self.setup_dirs()
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 490, in setup_dirs
    self.links_run_dir.mkdir(exist_ok=True)
  File "/opt/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/pathlib.py", line 1311, in mkdir
    os.mkdir(self, mode)
OSError: [Errno 28] No space left on device: '/scratch/cryosparc/instance_login:61001/links/P10-J44-1773148660'

So I checked the worker node:

[root@gpubig02 scratch]# df -hl /scratch/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  7.0T  7.0T   48K 100% /scratch
[root@gpubig02 scratch]# du -hd1 /scratch
6.3T	/scratch/cryosparc
627G	/scratch/alphafold_databases
6.9T	/scratch
[root@gpubig02 scratch]# getfacl /scratch
getfacl: Removing leading '/' from absolute path names
# file: scratch
# owner: root
# group: root
user::rwx
group::rwx
other::rwx

[root@gpubig02 scratch]# getfacl /scratch/cryosparc
getfacl: Removing leading '/' from absolute path names
# file: scratch/cryosparc
# owner: cryosparc
# group: cryosparc
user::rwx
group::r-x
other::r-x

So as you can see, the majority of the space is occupied by cryoSPARC.

Also verified:

  • no other cryosparc job running
  • input data small enough for caching (112G)

So I looked into the cache:

  • there is only a single subfolder (instance_login:61001)
  • all the files are owned by the cryosparc user, and all files are older than 10 days:
[root@gpubig02 store-v2]# ls -ltrah /scratch/cryosparc/instance_login:61001/store-v2/* | awk '{print $3","$4","$6","$7}' | sort | uniq
#User,Group,Month,Day
cryosparc,cryosparc,Feb,10
cryosparc,cryosparc,Feb,12
cryosparc,cryosparc,Feb,13
cryosparc,cryosparc,Feb,14
cryosparc,cryosparc,Feb,18
cryosparc,cryosparc,Feb,19
cryosparc,cryosparc,Feb,21
cryosparc,cryosparc,Feb,23
cryosparc,cryosparc,Feb,25
cryosparc,cryosparc,Feb,26
cryosparc,cryosparc,Feb,27
cryosparc,cryosparc,Feb,5
cryosparc,cryosparc,Feb,6
cryosparc,cryosparc,Feb,9
[root@gpubig02 store-v2]# date
Tue Mar 10 02:33:59 PM CET 2026

To my understanding, cryosparc should be able to overwrite these old files.

Here are also our master config:

# Instance Configuration
export CRYOSPARC_LICENSE_ID="REDACTED"
export CRYOSPARC_MASTER_HOSTNAME="login"
export CRYOSPARC_DB_PATH="/opt/cryosparc/cryosparc_database/"
export CRYOSPARC_BASE_PORT=61000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000

# Security
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true

# Cluster Integration
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000

# Project Configuration
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'

# Development
export CRYOSPARC_DEVELOP=false

# Other
export CRYOSPARC_CLICK_WRAP=true

and cluster_info.json:

{
    "name": "slurm-6000ada",
    "worker_bin_path": "/opt/cryosparc/cryosparc_worker/bin/cryosparcw",
    "cache_path": "/scratch/cryosparc/",
    "send_cmd_tpl": "{{ command }}",
    "qsub_cmd_tpl": "/usr/bin/sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl": "/usr/bin/squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl": "/usr/bin/scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl": "/usr/bin/sinfo"
}

And cluster_script.sh:

#!/usr/bin/env bash
#SBATCH --chdir={{ job_dir_abs }}
#SBATCH --export=NONE
#SBATCH --partition=big
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --cpus-per-task={{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --mem={{ [128, (4 * ram_gb) | int] | max }}G
#SBATCH --comment="created by {{ cryosparc_username }}"
#SBATCH --output={{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm.out
#SBATCH --error={{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm.err

{{ run_cmd }}


sleep 5  # accounting can lag slightly

sacct -j "$SLURM_JOB_ID" \
  --format=JobID,Elapsed,MaxRSS,ReqMem,AllocCPUS \
  > "{{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm_usage.txt"

If any more input is needed, I am happily supply them.

Best
Christian

edit:

rm -rf /scratch/cryosparc/instance_login\:61001

after manual clearing the cache, the job runs just as expected.

Is it safe to assume that the contents of /scratch/alphafold_databases did not change while the affected CryoSPARC job was running?

@ctueting Please can you try if adding

"cache_quota_mb": 10000,

(trailing comma , iff this is not the final item inside the JSON {} object) and subsequently reconnecting the cluster to CryoSPARC fixes this problem?

Yes, the alphafold_database is a static folder (the genomic databases for MSA generation in the AF3 pipeline).

The

"cache_reserve_mb": 10000,

is added to both SLURM partition definitions.

Unfortunatelly, I cleared all cache manually, so I need to queue a few jobs to fill this up again.

I will report once this is done.
[edited 2026-03-10 to correct variable name]

Did you really mean 10000 mb? Because now jobs fail as the limit is 10G.

> Mar 10, 2026, 5:54:27 PM

Traceback (most recent call last):
  File "cli/run.py", line 105, in cli.run.run_job
  File "cli/run.py", line 210, in cli.run.run_job_function
  File "compute/jobs/class3D/run.py", line 74, in compute.jobs.class3D.run.run_class_3D
  File "/opt/cryosparc/cryosparc_worker/compute/particles.py", line 152, in read_blobs
    u_blob_paths = cache.run(rc, u_rel_paths)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 935, in run
    return run_with_executor(rc, rel_sources, executor)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 978, in run_with_executor
    state = drive.allocate(sources, active_run_ids=info.active_run_ids)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 612, in allocate
    raise RuntimeError(
RuntimeError: SSD cache needs additional 130.17 GiB but drive can only be filled to 9.31 GiB. Please disable SSD cache for this job. 

EDIT2:
So, I changed this value to 6300000, as 6.3M MB roughly translates to 6.3 TiB, available on the SSD.
And I figured out, that the error might be completly in line with the cryoSPARC settings.
In the master config.sh, there was no CRYOSPARC_SSD_CACHE_LIFETIME_DAYS set. Meaning, that cache data remains for 30 days. The oldest files on this drive were Feb 10, which is 28 days ago. So the job was not able, to clear this data - by its own policy.

I thought, that even below these days, new jobs will overwrite these, as intuitivly, it’s better to recache, than to not able to run a cacheable job.

Anyways, I set the value to 14, so after 2 weeks, data will be overwritten.

I did, but stated the incorrect variable name :wink: (now corrected). The correct name is cache_reserve_mb.

Not necessarily. Particles that are not in use by an active job may be deleted sooner if space is required for other particles (guide).

I will add the cache_reserve_mb with 10000 and monitor the progress.

That was also my impression, that data is deleted even before the CACHE_LIFETIME_DAYS, thats the reason I opened this issue at the beginning - as the error suggested, that this is unfortunatelly not the case though.

Thanks for the help anyways, and as soon as I have some more solid data, I will come back and report.

Edit:
I have a small additional question, based on your post some years ago LINK.
So this value is to reserve some space for other applications? But if I set the cache_quota_mb to an appropriate lower value than the entire space, there should be inherently some space left for other applications. In our case, there are the AF3 databases, sitting on the same drive, but untouchable by cryosparc itself, due to its permissions.

So, I still have hard time to understand, how these values affect cache clearing itself. I just altering available space.
Or is my thinking wrong, and internally, it’s this logic:

  1. cryosparc sees 7T cache (even though 700G are out of scope due to AF3 db)
  2. currently used by cryosparc 6.3T
  3. logic says: there must be free space, and I am not clearing old cached data, as they are younger than the max lifetime?

Hi @ctueting, I can provide some insight here.

CryoSPARC checks the total free space on disk before caching files, not just the space that it’s using. So it assumes the 700GB in-use by AF3 is unavailable, and instead removes previously-cached files until enough total disk space is available to cache. If a quota and/or reserve are set, it deletes more cached files to respect these settings.

Here’s a comprehensive overview of how the cache system allocates space:

  • CryoSPARC tries to use the maximum amount of SSD space available to it
  • CryoSPARC always checks that a cache operation does not exceed available space on disk before performing it
    • If it would, it deletes its own files until there is enough space
    • If deleting unused files would not produce enough free space for the current job, it fails
    • CryoSPARC accounts for the space in-use by other applications to determine how much space it can use
  • CryoSPARC will never fill the SSD such that the amount of free space is smaller than the reserve (default 10GB)
    • If a cache operation would exceed the reserve, it first deletes files until there’s enough space for the given files + the reserve
    • If other applications have filled the disk so that the reserve has shrunk to below its specified size, CryoSPARC deletes enough unused files so that the reserve is restored once the cache operation is complete.
    • If deleting all cached files (besides those required by the current job) would not maintain the reserve (i.e., because the current files are too large, or other applications are using up most space), the job fails
  • If a quota is specified, CryoSPARC will never use more disk space than the specified amount.
  • Once CryoSPARC starts copying files to cache, it is not aware of other applications writing to cache, so it may fail with “No space left on device” in those cases.

In short, barring insufficient reserve/quota, if at the start of the job it’s possible to free up enough cached files to hold the current particle stack AND no other applications are actively writing to the cache, then the job will cache and run without failure or overuse.

By the way, CryoSPARC v5.0.0, v5.0.1 and v5.0.2 had a bug that prevented the cache reserve from getting applied. We’ve fixed this in the latest v5.0.3 update, please update to that to avoid errors similar to the one you reported. Please also re-connect your cluster with cryosparcm cluster connect to set the reserve correctly after updating.

Hope this helps, let me know if there’s anything I can clarify!

Related oddity. V5.03. A particle stack with fewer particle spread over more images requires more cache space. This becomes untenable for large data sets.

@MHB Please see Strange cache issue with V5.03 for a response.