Particles cache issues

DerLorenz · January 24, 2024, 10:21am

Hi,

We are running cryosparc v 4.4.1 on an HPC cluster. Since we updated on version 4.4 we are having issues with/after particle caching. We trying to run any job requiring particles cahinf (2D/3D classifications, refinements of any sort) we are getting the following error message quite often:

> [CPU:  358.9 MB]
> Using random seed of 1359298509
> 
> [CPU:  391.3 MB]
> Loading a ParticleStack with 306080 items...
> 
> [CPU:  391.4 MB]
> ──────────────────────────────────────────────────────────────
> SSD cache ACTIVE at /scratch-cbe/users/svc_cryosparc/lane_cache_CBE/instance_imp-cryosparc-1.vbc.ac.at:39001 (10 GB reserve) (15 TB quota)
>   Checking files on SSD ...
> 
> [CPU:   1.50 GB]
> Traceback (most recent call last):
>   File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
>   File "cryosparc_master/cryosparc_compute/jobs/hetero_refine/run.py", line 81, in cryosparc_master.cryosparc_compute.jobs.hetero_refine.run.run_hetero_refine
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
>     u_blob_paths = cache_run(u_rel_paths)
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
>     return run_with_executor(rel_sources, executor)
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 828, in run_with_executor
>     state = drive.allocate(sources, active_run_ids=info["active_run_ids"])
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 612, in allocate
>     self.create_run_links(sources)
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 511, in create_run_links
>     link.symlink_to(f"../../{STORE_DIR}/{source.key_prefix}/{source.key}")
>   File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/pathlib.py", line 1384, in symlink_to
>     self._accessor.symlink(target, self, target_is_directory)
>   File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/pathlib.py", line 446, in symlink
>     return os.symlink(a, b)
> FileNotFoundError: [Errno 2] No such file or directory: '../../store-v2/1f/1fa88eba90f3a61fb5bcc6a92e2350261b3dc495' -> '/scratch-cbe/users/svc_cryosparc/lane_cache_CBE/instance_imp-cryosparc-1.vbc.ac.at:39001/links/P386-J79-1706087336/1fa88eba90f3a61fb5bcc6a92e2350261b3dc495.mrc'

It seems that the caching struggles when multiple jobs from the same workspace are launched simultaneously.
Any help would be greatly appreciated.

Best,
Lorenz

DerLorenz · January 24, 2024, 11:24am

Here is the command core log:

> 2024-01-24 11:08:56,468 scheduler_run_core   INFO     | Jobs Queued: [('P386', 'J79'), ('P31', 'J3')]
> 2024-01-24 11:08:56,472 scheduler_run_core   INFO     | Licenses currently active : 8
> 2024-01-24 11:08:56,472 scheduler_run_core   INFO     | Now trying to schedule J79
> 2024-01-24 11:08:56,472 scheduler_run_job    INFO     |    Scheduling job to CBE
> 2024-01-24 11:08:57,710 scheduler_run_job    INFO     | Not a commercial instance - heartbeat set to 12 hours.
> 2024-01-24 11:08:57,780 scheduler_run_job    INFO     |      Launchable! -- Launching.
> 2024-01-24 11:08:57,787 set_job_status       INFO     | Status changed for P386.J79 from queued to launched
> 2024-01-24 11:08:57,788 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P386, workspace_uid None, job_uid J79 with body {'projectUid': 'P386', 'jobUid': 'J79'}
> 2024-01-24 11:08:57,794 app_stats_refresh    INFO     | code 200, text {"success":true}
> 2024-01-24 11:08:57,820 run_job              INFO     |       Running P386 J79
> 2024-01-24 11:08:57,844 run_job              INFO     |         cmd: ssh cbe.vbc.ac.at sbatch --wckey=cryosparc /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J79/queue_sub_script.sh 
> 2024-01-24 11:08:59,066 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:08:59,066 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:08:59,067 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:08:59,067 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:08:59,068 scheduler_run_core   INFO     | Finished
> 2024-01-24 11:08:59,092 scheduler_run_core   INFO     | Running...
> 2024-01-24 11:08:59,092 scheduler_run_core   INFO     | Jobs Queued: [('P31', 'J3')]
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:08:59,096 scheduler_run_core   INFO     | Finished
> ESC[0mSending command: ssh cbe.vbc.ac.at squeue -j 65347214 ESC[0m
> ESC[0mSending command: ssh cbe.vbc.ac.at squeue -j 64918082 ESC[0m
> ESC[0mCluster job status check failed with exit code 1ESC[0m
> ESC[0mslurm_load_jobs error: Invalid job id specified
> ESC[0m
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    | Cluster job status update for P24 J261 failed with exit code 1
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    | slurm_load_jobs error: Invalid job id specified
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.
> 2024-01-24 11:09:04,473 dump_workspaces      INFO     | Exporting all workspaces in P197...
> 2024-01-24 11:09:04,495 dump_workspaces      INFO     | Exported all workspaces in P197 to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/20221030_Doa10_BRIL_Nb_nanodisc_b3_g1/workspaces.json in 0.02s
> 2024-01-24 11:09:04,677 scheduler_run_core   INFO     | Running...
> 2024-01-24 11:09:04,678 scheduler_run_core   INFO     | Jobs Queued: [('P31', 'J3')]
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:09:04,682 scheduler_run_core   INFO     | Finished
> 2024-01-24 11:09:06,256 set_job_status       INFO     | Status changed for P386.J79 from launched to started
> 2024-01-24 11:09:06,258 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P386, workspace_uid None, job_uid J79 with body {'projectUid': 'P386', 'jobUid': 'J79'}
> 2024-01-24 11:09:06,265 app_stats_refresh    INFO     | code 200, text {"success":true}
> 2024-01-24 11:09:07,356 set_job_status       INFO     | Status changed for P386.J79 from started to running
> 2024-01-24 11:09:07,358 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P386, workspace_uid None, job_uid J79 with body {'projectUid': 'P386', 'jobUid': 'J79'}
> 2024-01-24 11:09:07,367 app_stats_refresh    INFO     | code 200, text {"success":true}
> 2024-01-24 11:09:09,737 scheduler_run_core   INFO     | Running...
> 2024-01-24 11:09:09,737 scheduler_run_core   INFO     | Jobs Queued: [('P31', 'J3')]
> 2024-01-24 11:09:09,742 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:09:09,743 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:09:09,743 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:09:09,743 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:09:09,745 scheduler_run_core   INFO     | Finished
> ESC[0mSending command: ssh cbe.vbc.ac.at squeue -j 64918082 ESC[0m
> ESC[0mCluster job status check failed with exit code 1ESC[0m
> ESC[0mslurm_load_jobs error: Invalid job id specified
> ESC[0m
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    | Cluster job status update for P24 J261 failed with exit code 1
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    | slurm_load_jobs error: Invalid job id specified
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.

wtempel · January 24, 2024, 7:56pm

Hi @DerLorenz, please can you post

the outputs of the command

cat /users/svc_cryosparc/cryosparc_worker/config.sh

the contents of the queue_sub_script.sh file for one of the affected jobs

ebirn · January 25, 2024, 8:29am

Hi,
I’m adding some system information for @DerLorenz

$ cat /users/svc_cryosparc/cryosparc_worker/config.sh

export CRYOSPARC_LICENSE_ID="<license_id_redacted_by_me>"
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/software/2020/software/cudacore/11.1.1"
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=29
export CRYOSPARC_CACHE_NUM_THREADS=8
export CRYOSPARC_IMPROVED_SSD_CACHE=true

The submission script (it’s not of the original job id, but this one is with the same error:


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
##
## provide this information together with a link to this website in Helpdesk tickets at: http://support.vbc.ac.at"
## ===========================================
## Cryosparc project: P386 J73
## Cryosparc user: lorenz.grundmann@imp.ac.at
## Cryosparc job dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73
## Cryosparc project dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys
## Cryosparc job log: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log
## ===========================================
##
#### cryoSPARC cluster submission script template for SLURM
## variables via: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connect-a-managed-worker-to-cryosparc
## Available variables:
## run_cmd: ~/cryosparc_worker/bin/cryosparcw run --project P386 --job J73 --master_hostname imp-cryosparc-1.vbc.ac.at --master_command_core_port 39002 > /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log 2>&1             - the complete command string to run the job
## num_cpu: 2            - the number of CPUs needed
## num_gpu: 1            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## ram_gb: 24.0             - the amount of RAM needed in GB
## job_dir_abs: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73        - absolute path to the job directory
## project_dir_abs: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys    - absolute path to the project dir
## job_log_path_abs: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log   - absolute path to the log file for the job
## worker_bin_path: ~/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## run_args: --project P386 --job J73 --master_hostname imp-cryosparc-1.vbc.ac.at --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## project_uid: P386        - uid of the project
## job_uid: J73            - uid of the job
## job_creator: lorenz.grundmann@had        - name of the user that created the job (may contain spaces)
## cryosparc_username: lorenz.grundmann@imp.ac.at - cryosparc username of the user that created the job (usually an email)
## job_type: class_2D_new           - cryosparc job type
##                                    defined in /cryosparc_master/cryosparc_compute/jobs/jobregister.py
##
## ====== parameters ======
## 'cbe_time': 
##   - 'auto': select based on job_type(TBD), currently: 8h
##   - 'short', 'medium', 'long' - to select max run duration for that QOS (8h, 48h, 168h)
##   - max runtime in MINUTES
## 'cbe_ram_multiplier': 
##   - default = 1.0
##
## inputs:
##   cbe_time=auto
##   cbe_ram_multiplier=1.0

## configured qos_limit: {'short': 480, 'medium': 2880, 'long': 20160}

## cbe_time auto select -> short, or based on job type: class_2D_new


## auto time select: qos=short for job_type: class_2D_new

## cbe_time: 480
## qos: short
##
## What follows is a simple SLURM script:
#SBATCH --job-name cs_lorenz.grundmann_P386_J73_class_2D_new
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --partition g
#SBATCH --qos short
#SBATCH --time 480
#SBATCH --mem=24000MB
#SBATCH -o /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/P386_J73.out
#SBATCH -e /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/P386_J73.err
#SBATCH -A had
#SBATCH --comment 'cbe_time=auto;cbe_ram_multiplier=1.0;qos=short;job_type=class_2D_new;'


echo "starting Cryosparc work ..."

~/cryosparc_worker/bin/cryosparcw run --project P386 --job J73 --master_hostname imp-cryosparc-1.vbc.ac.at --master_command_core_port 39002 > /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log 2>&1 

#
# provide the output below for servicedesk requests: http://support.vbc.ac.at
#
cat << EOF | tee "/groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/jobinfo.txt"
Cryosparc job resource consumption summary:

cryosparc project/job: P386 J73
slurm jobid: $SLURM_JOBID
cryosparc user: lorenz.grundmann@imp.ac.at
job_dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73
project_dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys
job_log: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log

job_type: class_2D_new
cbe_time: auto
cbe_ram_multiplier: 1.0
cryosparc ram_gb: 24.0
time_limit: 480
qos: short

# =========================================== 

$(jobinfo $SLURM_JOBID)

# =========================================== 

$(scontrol show job $SLURM_JOBID)

# =========================================== 

EOF


==========================================================================
==========================================================================

wtempel · January 25, 2024, 3:50pm

Thanks @ebirn for providing these details.
Please can you describe the cache implementation, including details like

Does the cache_path as defined for the CryoSPARC target include symbolic links.
Is the storage for cache_path local to the compute not or shared between compute nodes.
In case of share cache storage, please can you post storage server (export) and client (mount) configuration settings/options.

ebirn · January 25, 2024, 4:09pm

Hi,

The cache path configured for the cache does not contain any symlinks
it is a shared filesystem between compute nodes
filesystem type is Beegfs. It is a shared filesystem for multiple users and mounted on all compute nodes in the cluster. It is constantly used for various temporary data (as scratch filesystem)

There is sufficient space in the filesystem and the user running the cryosparc jobs does have access to the filesystem. While we can sporadically reproduce this cache issue, I don’t believe it is an issue with the underlying filesystem. I think the over all error rates would be higher than what we see.

@DerLorenz can you confirm this is happening only to specific job type(s), and with those not 100% of the time reproducible?

Best,
Erich

DerLorenz · January 25, 2024, 5:23pm

Thanks for looking into this @wtempel.

All jobs that require caching do fail in 1/3 of all submissions. We brut force the jobs by relaunching untill it eventually works. Consequently, the statistics posted by @ebirn do not reflect the current status as we simply restart the job. Once it finished the initially failed jobs will not be registered as “failed”.

ebirn · February 1, 2024, 4:22pm

@wtempel is there any other information that we can provide in order to help this debugging along?

wtempel · February 1, 2024, 4:56pm

There is. We would like to know whether the problems are related to the filesystem type (of the cache storage). Would it be possible for you to test caching reliability when cache storage is implemented on a different filesystem type? For example, are workers’ /tmp directories implemented on a different filesystem type, and have sufficient capacity to temporarily provide cache storage?

ebirn · February 27, 2024, 10:49am

We’ve tried to reproduce the issue on our 2nd Cryosparc instance, but we were unable to do so.
On the 2nd instsance, the caches are per-node local SSDs.
The instance where the issue is triggered, it is a cluster lane with a shared filesystem (Beegfs).

We observe this as a regression on the shared filesystem, since we’ve updated to v4.4.1. We assume it’s related to the cache locking implementation on BeeGFS.

Do you recognise this as a bug, and otherwise what is your recommended setup in this situation?
Should we disable the cache? Or, is there a way to configure a cache path that is scoped to each compute job in the cluster (i.e. Dynamic SSD Cache Paths), but on an ephemeral filsystem?
i.e. /beegfs/per-job-dir that will be wiped after the job run. - My impression was that the cluster lane effectively requires a shared filesystem as cache.

wtempel · February 27, 2024, 6:42pm

There is at least one way:

a (slurm.conf) Prolog= script that creates a /scr/slurm_job_${SLURM_JOB_ID} directory and chowns it to $SLURM_JOB_USER
combined with a TaskProlog= script that echos a variable definition that points to directory created in the previous step, like
```
echo "export SLURM_SCR=/scr/slurm_job_${SLURM_JOB_ID}"
```
combined with a line in the in the cluster script template like
```
export CRYOSPARC_SSD_PATH=$SLURM_SCR
```
It may be possible to instead just
```
echo "export CRYOSPARC_SSD_PATH=/scr/slurm_job_${SLURM_JOB_ID}"
```
in the task prolog (previous step), but I have not tested that (see CryoSPARC environment variables).
combined with an Epilog= script that removes the job-specific scratch directory.

Worker host-specific, non-shared scratch storage is (supposed to be) supported, and the most performant option (nvme ssd), as far as I know. Did you find otherwise?

ebirn · February 28, 2024, 1:22pm

We have the infrastructure in place and automatically wipe /tmp for each job after it’s finished (we have a private /tmp bind mounted for each job).

For our clarification: the new cache v2 does not keep any information in the database and the cache is self contained in the filesystem(s). I.e. we will not accumulate unusable cache objects (or metadata thereof) when the per-job cache directory is wiped after each run.

We see the documentation in the cache_v2.py file, referencing TODO of alternative global cache locking mechanism. This is also what we would need. Currently our BeeGFS does not have file locking enabled and we cannot easily change this at the moment.

nfrasser · February 29, 2024, 6:27pm

Hi @ebirn, thanks again for providing all this information.

It looks like this is exactly what we’ll need to do to make the cache system work reliably for you. I’ve added it to our tracker for implementation in a future release. Will update here when the fix is available!

DerLorenz · March 19, 2024, 11:31am

Hi @nfrasser & @wtempel,

We now implemented particle caching on a per job basis. This seemed to solve our problems, but we just now ran into another issue.
Following a standard analysis pipeline everything works just fine.
In one project I was reimporting particle.star files after a relion 3D classification using the “import particle stack” job. When refining each imported particle stack job alone, the refinements do run succesfully. The moment I combine imported stacks the jobs fail due to cache locking (presumably). We get the following error message:

[CPU:  407.2 MB]
──────────────────────────────────────────────────────────────
SSD cache ACTIVE at /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001 (10 GB reserve) (15 TB quota)
┌─────────────────────┬────────────────────────┐
│ Cache usage         │                 Amount │
├─────────────────────┼────────────────────────┤
│ Total / Usable      │ 244.40 TiB / 13.64 TiB │
│ Used / Free         │  15.27 GiB / 13.63 TiB │
│ Hits / Misses       │     0.00 B / 15.27 GiB │
│ Acquired / Required │     15.27 GiB / 0.00 B │
└─────────────────────┴────────────────────────┘
Progress: [▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 20978/20978 (100%)
  Transferred: FoilHole_8835610_Data_8832246_8832248_20230109_233539_EER_patch_aligned_doseweighted_particles.mrc (479.12 kiB)
      Threads: 8
    Avg speed: 534.33 MiB/s
    Remaining: 0h 00m 00s (0.00 B)
      Elapsed: 0h 01m 16s
  Active jobs: P168-J641
SSD cache complete for 20978 file(s)
──────────────────────────────────────────────────────────────

[CPU:  407.2 MB]
SSD cache: using override environment variable for cache $CRYOSPARC_SSD_PATH=/tmp

[CPU:  445.6 MB]
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 876, in run_with_executor
    drive.unlock_file(transfer[0].key)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 656, in unlock_file
    cached.unlock()
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 241, in unlock
    assert self.locked, f"Cannot unlock cache store file {self.path}"
AssertionError: Cannot unlock cache store file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

Here a little bit background for the particles:

Particles are extracted in cryoSPARC.
Imported to relion using cs2star tool (GitHub - brisvag/cs2star). This results in the particle.mrc being softlinked as particles.mrcs in my relion project and referenced accordingly in the star file.
3D classification in relion
Selected classes are individually imported into cryoSPARC. The particle data path is the /path/to/softlinked/particles.mrcs (referencing back to the particle.mrc extracted in CS).

@ebirn was wondering if there is a possibility to disable the cache locking. In our setup (per job caching) we should not have any concurrent use of the cache.

If you need any further information/logs etc, let us know.

Thanks a lot for your help!

Best,
Lorenz

nfrasser · March 19, 2024, 4:55pm

@DerLorenz thanks for sharing this, this doesn’t look related to file-system level locking. Instead, it seems related to a special filename suffix “lock” we use to ensure that two jobs don’t copy the same file. It looks like the cache system successfully completed caching, but tried to remove the filename suffix of a file that didn’t have it (i.e., already previously unlocked). This may indicate a bug in the cache system.

Could you share the output log for this job with me for further troubleshooting? You can find it the web interface: open the job and go to Metadata > Log. You may send it to me via direct-message.

ebirn · March 20, 2024, 11:04am

Hi,
Note: the cache is currently scoped to the life-time of the slurm job itself, i.e. this whole /tmp filesystem will be removed with the job.

We typically see 2 types of stack traces:
Variant 1:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 855, in run_with_executor
    transfer = task.result()
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 644, in transfer_allocated_file
    assert cached.locked, f"Cannot copy {source.path} to unlocked cache file {cached.path}"
AssertionError: Cannot copy /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-hannabox1-1-20230104-124954-208236/J523/extract/FoilHole_8834114_Data_8832246_8832248_20230109_183045_EER_patch_aligned_doseweighted_particles.mrc to unlocked cache file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

Variant 2:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 126, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
    u_blob_paths = cache_run(u_rel_paths)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
    return run_with_executor(rel_sources, executor)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 876, in run_with_executor
    drive.unlock_file(transfer[0].key)
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 656, in unlock_file
    cached.unlock()
  File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 241, in unlock
    assert self.locked, f"Cannot unlock cache store file {self.path}"
AssertionError: Cannot unlock cache store file /tmp/instance_imp-cryosparc-1.vbc.ac.at:39001/store-v2/d7/d73d418654522a5e791f2c152f4a1805001a2652

the full log is just megabytes of lines of “Transferred … file to SSD key …”

ebirn · March 20, 2024, 11:08am

full logs are available here:
https://filesender.aco.net/?s=download&token=403268f0-9eef-4f6c-8dc3-7c61bce9243f

nfrasser · March 20, 2024, 4:02pm

Thanks @DerLorenz and @ebirn, we were able to find the issue on our end. These errors appear to happen when there the input particles are located in imported .mrc/.mrcs files, which CryoSPARC tracks as unique symbolic links but resolve to the same original file.

We were able to reproduce with an arrangement like this:

/
├── particles
│   ├── relion_particles_A.mrcs
│   ├── relion_particles_B.mrcs
│   ├── particles1.mrcs
│   ├── particles2.mrcs
│   └── particles3.mrcs
└── projects
    └── P1
        ├── J1
        │   ├── imported
        │   │   ├── particles1.mrcs -> ../../../../particles/particles1.mrcs
        │   │   ├── particles2.mrcs -> ../../../../particles/particles2.mrcs
        │   │   └── particles3.mrcs -> ../../../../particles/particles3.mrcs
        │   └── imported_particles.cs
        ├── J2
        │   ├── imported
        │   │   ├── particles1.mrcs -> ../../../../particles/particles1.mrcs
        │   │   ├── particles2.mrcs -> ../../../../particles/particles2.mrcs
        │   │   └── particles3.mrcs -> ../../../../particles/particles3.mrcs
        │   └── imported_particles.cs
        └── J3
            └── job.log

J1 imports relion_particles_A.star and J2 imports relion_particles_B.star, but both star files point to the particles located the same .mrcs stack files. Both imports are then given as input to J3. Though the particle locations differ, J3 gets the following list of files to cache:

J1/imported/particles1.mrcs
J1/imported/particles2.mrcs
J1/imported/particles3.mrcs
J2/imported/particles1.mrcs
J2/imported/particles2.mrcs
J2/imported/particles3.mrcs

The cache system sees these paths as unique and tries to copy all of them, but because they resolve to the same original files, it tries to copy them twice and runs one of the errors above.

Could you confirm that you have a similar arrangement? Note that J3 can be much further downstream from the original imported particles.

A workaround you could try until our fix is out is to restack the input particles before running the refinement jobs so that they don’t use the same path.

Thanks again and let me know how this works for you or if you have any questions!

PS: I assume that the particle locations in the import jobs are unique, though the paths are not? Careful that you don’t provide the same particle locations to refinement jobs to avoid biasing the results.

DerLorenz · March 21, 2024, 12:15pm

Hi @nfrasser,

Indeed we have a similar setup. I have several particle stacks (particle.mrc) extracted in one CS extract job. After 3D classification I reimport to CS.
As particle.mrcs (symlink) of different classes will link to the same original particle.mrc stack (but different particles therein) the CS-refine jobs will have to access the same stack, if two or more classes are combined. Thus, we have the same underlying data structure given I understood yours correctly.

Thanks for the heads up, I do not try to refine the same particles.

I figured that reextraction and restacking should take care of the problem. While being a viable workaround, I will have the particles saved twice (restacked/reextracted and the original stack) and thus require more storage. As I am also still working with the original stack in relion and other software, I cannot delete the old stack.
Another workaorund is to combine the classes star files and import the combined.star. That at least reduces storage space.

Best,
Lorenz

nfrasser · May 7, 2024, 9:56pm

Hi @DerLorenz and @ebirn, thanks again for your feedback on the new cache system. We’ve addressed both issues noted in this topic (synchronization-related error and symlinking bug) in the latest CryoSPARC v4.5, released May 7 2024.

Once updated, please add the following line to cryosparc_worker/config.sh to use a global lock instead of a file lock:

export CRYOSPARC_CACHE_LOCK_STRATEGY="master"

This means you can revert your implementation of per-job particle caching so that jobs may re-use cache from jobs.