Hi cryoSPARC team,
we have encountered an issue with the ssh cache, and I am slightly puzzled. Other topics in here were not unfortunatelly not helpful.
Our system:
- cyroSPARC v5.0.1
- SLURM cluster with 2 different partitions
- Each node in each partition has ~6.3T ssd cache available
This is the error log:
Loading a ParticleStack with 48439 items...
──────────────────────────────────────────────────────────────
SSD cache ACTIVE at /scratch/cryosparc/instance_login:61001
Checking and allocating files on SSD ...
Encountered OS error while caching: [Errno 28] No space left on device: '/scratch/cryosparc/instance_login:61001/links/P10-J44-1773148660'; dumping info
Traceback (most recent call last):
File "cli/run.py", line 105, in cli.run.run_job
File "cli/run.py", line 210, in cli.run.run_job_function
File "compute/jobs/helix/run_refine.py", line 250, in compute.jobs.helix.run_refine.run
File "/opt/cryosparc/cryosparc_worker/compute/particles.py", line 152, in read_blobs
u_blob_paths = cache.run(rc, u_rel_paths)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 935, in run
return run_with_executor(rc, rel_sources, executor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 978, in run_with_executor
state = drive.allocate(sources, active_run_ids=info.active_run_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 593, in allocate
self.setup_dirs()
File "/opt/cryosparc/cryosparc_worker/compute/cache.py", line 490, in setup_dirs
self.links_run_dir.mkdir(exist_ok=True)
File "/opt/cryosparc/cryosparc_worker/.pixi/envs/worker/lib/python3.12/pathlib.py", line 1311, in mkdir
os.mkdir(self, mode)
OSError: [Errno 28] No space left on device: '/scratch/cryosparc/instance_login:61001/links/P10-J44-1773148660'
So I checked the worker node:
[root@gpubig02 scratch]# df -hl /scratch/
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 7.0T 7.0T 48K 100% /scratch
[root@gpubig02 scratch]# du -hd1 /scratch
6.3T /scratch/cryosparc
627G /scratch/alphafold_databases
6.9T /scratch
[root@gpubig02 scratch]# getfacl /scratch
getfacl: Removing leading '/' from absolute path names
# file: scratch
# owner: root
# group: root
user::rwx
group::rwx
other::rwx
[root@gpubig02 scratch]# getfacl /scratch/cryosparc
getfacl: Removing leading '/' from absolute path names
# file: scratch/cryosparc
# owner: cryosparc
# group: cryosparc
user::rwx
group::r-x
other::r-x
So as you can see, the majority of the space is occupied by cryoSPARC.
Also verified:
- no other cryosparc job running
- input data small enough for caching (112G)
So I looked into the cache:
- there is only a single subfolder (
instance_login:61001) - all the files are owned by the cryosparc user, and all files are older than 10 days:
[root@gpubig02 store-v2]# ls -ltrah /scratch/cryosparc/instance_login:61001/store-v2/* | awk '{print $3","$4","$6","$7}' | sort | uniq
#User,Group,Month,Day
cryosparc,cryosparc,Feb,10
cryosparc,cryosparc,Feb,12
cryosparc,cryosparc,Feb,13
cryosparc,cryosparc,Feb,14
cryosparc,cryosparc,Feb,18
cryosparc,cryosparc,Feb,19
cryosparc,cryosparc,Feb,21
cryosparc,cryosparc,Feb,23
cryosparc,cryosparc,Feb,25
cryosparc,cryosparc,Feb,26
cryosparc,cryosparc,Feb,27
cryosparc,cryosparc,Feb,5
cryosparc,cryosparc,Feb,6
cryosparc,cryosparc,Feb,9
[root@gpubig02 store-v2]# date
Tue Mar 10 02:33:59 PM CET 2026
To my understanding, cryosparc should be able to overwrite these old files.
Here are also our master config:
# Instance Configuration
export CRYOSPARC_LICENSE_ID="REDACTED"
export CRYOSPARC_MASTER_HOSTNAME="login"
export CRYOSPARC_DB_PATH="/opt/cryosparc/cryosparc_database/"
export CRYOSPARC_BASE_PORT=61000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
# Security
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
# Cluster Integration
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
# Project Configuration
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
# Development
export CRYOSPARC_DEVELOP=false
# Other
export CRYOSPARC_CLICK_WRAP=true
and cluster_info.json:
{
"name": "slurm-6000ada",
"worker_bin_path": "/opt/cryosparc/cryosparc_worker/bin/cryosparcw",
"cache_path": "/scratch/cryosparc/",
"send_cmd_tpl": "{{ command }}",
"qsub_cmd_tpl": "/usr/bin/sbatch {{ script_path_abs }}",
"qstat_cmd_tpl": "/usr/bin/squeue -j {{ cluster_job_id }}",
"qdel_cmd_tpl": "/usr/bin/scancel {{ cluster_job_id }}",
"qinfo_cmd_tpl": "/usr/bin/sinfo"
}
And cluster_script.sh:
#!/usr/bin/env bash
#SBATCH --chdir={{ job_dir_abs }}
#SBATCH --export=NONE
#SBATCH --partition=big
#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --cpus-per-task={{ num_cpu }}
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --mem={{ [128, (4 * ram_gb) | int] | max }}G
#SBATCH --comment="created by {{ cryosparc_username }}"
#SBATCH --output={{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm.out
#SBATCH --error={{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm.err
{{ run_cmd }}
sleep 5 # accounting can lag slightly
sacct -j "$SLURM_JOB_ID" \
--format=JobID,Elapsed,MaxRSS,ReqMem,AllocCPUS \
> "{{ job_dir_abs }}/{{ project_uid }}_{{ job_uid }}_slurm_usage.txt"
If any more input is needed, I am happily supply them.
Best
Christian
edit:
rm -rf /scratch/cryosparc/instance_login\:61001
after manual clearing the cache, the job runs just as expected.
