Particles cache issues

Hi,

We are running cryosparc v 4.4.1 on an HPC cluster. Since we updated on version 4.4 we are having issues with/after particle caching. We trying to run any job requiring particles cahinf (2D/3D classifications, refinements of any sort) we are getting the following error message quite often:

> [CPU:  358.9 MB]
> Using random seed of 1359298509
> 
> [CPU:  391.3 MB]
> Loading a ParticleStack with 306080 items...
> 
> [CPU:  391.4 MB]
> ──────────────────────────────────────────────────────────────
> SSD cache ACTIVE at /scratch-cbe/users/svc_cryosparc/lane_cache_CBE/instance_imp-cryosparc-1.vbc.ac.at:39001 (10 GB reserve) (15 TB quota)
>   Checking files on SSD ...
> 
> [CPU:   1.50 GB]
> Traceback (most recent call last):
>   File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
>   File "cryosparc_master/cryosparc_compute/jobs/hetero_refine/run.py", line 81, in cryosparc_master.cryosparc_compute.jobs.hetero_refine.run.run_hetero_refine
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 120, in read_blobs
>     u_blob_paths = cache_run(u_rel_paths)
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 796, in run
>     return run_with_executor(rel_sources, executor)
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 828, in run_with_executor
>     state = drive.allocate(sources, active_run_ids=info["active_run_ids"])
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 612, in allocate
>     self.create_run_links(sources)
>   File "/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache_v2.py", line 511, in create_run_links
>     link.symlink_to(f"../../{STORE_DIR}/{source.key_prefix}/{source.key}")
>   File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/pathlib.py", line 1384, in symlink_to
>     self._accessor.symlink(target, self, target_is_directory)
>   File "/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/pathlib.py", line 446, in symlink
>     return os.symlink(a, b)
> FileNotFoundError: [Errno 2] No such file or directory: '../../store-v2/1f/1fa88eba90f3a61fb5bcc6a92e2350261b3dc495' -> '/scratch-cbe/users/svc_cryosparc/lane_cache_CBE/instance_imp-cryosparc-1.vbc.ac.at:39001/links/P386-J79-1706087336/1fa88eba90f3a61fb5bcc6a92e2350261b3dc495.mrc'

It seems that the caching struggles when multiple jobs from the same workspace are launched simultaneously.
Any help would be greatly appreciated.

Best,
Lorenz

Here is the command core log:

> 2024-01-24 11:08:56,468 scheduler_run_core   INFO     | Jobs Queued: [('P386', 'J79'), ('P31', 'J3')]
> 2024-01-24 11:08:56,472 scheduler_run_core   INFO     | Licenses currently active : 8
> 2024-01-24 11:08:56,472 scheduler_run_core   INFO     | Now trying to schedule J79
> 2024-01-24 11:08:56,472 scheduler_run_job    INFO     |    Scheduling job to CBE
> 2024-01-24 11:08:57,710 scheduler_run_job    INFO     | Not a commercial instance - heartbeat set to 12 hours.
> 2024-01-24 11:08:57,780 scheduler_run_job    INFO     |      Launchable! -- Launching.
> 2024-01-24 11:08:57,787 set_job_status       INFO     | Status changed for P386.J79 from queued to launched
> 2024-01-24 11:08:57,788 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P386, workspace_uid None, job_uid J79 with body {'projectUid': 'P386', 'jobUid': 'J79'}
> 2024-01-24 11:08:57,794 app_stats_refresh    INFO     | code 200, text {"success":true}
> 2024-01-24 11:08:57,820 run_job              INFO     |       Running P386 J79
> 2024-01-24 11:08:57,844 run_job              INFO     |         cmd: ssh cbe.vbc.ac.at sbatch --wckey=cryosparc /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J79/queue_sub_script.sh 
> 2024-01-24 11:08:59,066 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:08:59,066 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:08:59,067 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:08:59,067 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:08:59,068 scheduler_run_core   INFO     | Finished
> 2024-01-24 11:08:59,092 scheduler_run_core   INFO     | Running...
> 2024-01-24 11:08:59,092 scheduler_run_core   INFO     | Jobs Queued: [('P31', 'J3')]
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:08:59,095 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:08:59,096 scheduler_run_core   INFO     | Finished
> ESC[0mSending command: ssh cbe.vbc.ac.at squeue -j 65347214 ESC[0m
> ESC[0mSending command: ssh cbe.vbc.ac.at squeue -j 64918082 ESC[0m
> ESC[0mCluster job status check failed with exit code 1ESC[0m
> ESC[0mslurm_load_jobs error: Invalid job id specified
> ESC[0m
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    | Cluster job status update for P24 J261 failed with exit code 1
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:02,640 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    | slurm_load_jobs error: Invalid job id specified
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:02,642 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.
> 2024-01-24 11:09:04,473 dump_workspaces      INFO     | Exporting all workspaces in P197...
> 2024-01-24 11:09:04,495 dump_workspaces      INFO     | Exported all workspaces in P197 to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/20221030_Doa10_BRIL_Nb_nanodisc_b3_g1/workspaces.json in 0.02s
> 2024-01-24 11:09:04,677 scheduler_run_core   INFO     | Running...
> 2024-01-24 11:09:04,678 scheduler_run_core   INFO     | Jobs Queued: [('P31', 'J3')]
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:09:04,681 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:09:04,682 scheduler_run_core   INFO     | Finished
> 2024-01-24 11:09:06,256 set_job_status       INFO     | Status changed for P386.J79 from launched to started
> 2024-01-24 11:09:06,258 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P386, workspace_uid None, job_uid J79 with body {'projectUid': 'P386', 'jobUid': 'J79'}
> 2024-01-24 11:09:06,265 app_stats_refresh    INFO     | code 200, text {"success":true}
> 2024-01-24 11:09:07,356 set_job_status       INFO     | Status changed for P386.J79 from started to running
> 2024-01-24 11:09:07,358 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P386, workspace_uid None, job_uid J79 with body {'projectUid': 'P386', 'jobUid': 'J79'}
> 2024-01-24 11:09:07,367 app_stats_refresh    INFO     | code 200, text {"success":true}
> 2024-01-24 11:09:09,737 scheduler_run_core   INFO     | Running...
> 2024-01-24 11:09:09,737 scheduler_run_core   INFO     | Jobs Queued: [('P31', 'J3')]
> 2024-01-24 11:09:09,742 scheduler_run_core   INFO     | Licenses currently active : 9
> 2024-01-24 11:09:09,743 scheduler_run_core   INFO     | Now trying to schedule J3
> 2024-01-24 11:09:09,743 scheduler_run_core   INFO     |     Queue status waiting_licenses
> 2024-01-24 11:09:09,743 scheduler_run_core   INFO     |     Queue message 
> 2024-01-24 11:09:09,745 scheduler_run_core   INFO     | Finished
> ESC[0mSending command: ssh cbe.vbc.ac.at squeue -j 64918082 ESC[0m
> ESC[0mCluster job status check failed with exit code 1ESC[0m
> ESC[0mslurm_load_jobs error: Invalid job id specified
> ESC[0m
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    | Cluster job status update for P24 J261 failed with exit code 1
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:13,656 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    | slurm_load_jobs error: Invalid job id specified
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    | Traceback (most recent call last):
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_command/command_core/__init__.py", line 2883, in update_cluster_job_status
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id, template_args)
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/cryosparc_compute/cluster.py", line 176, in get_cluster_job_status
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 415, in check_output
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |   File "/home/svc_cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/subprocess.py", line 516, in run
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    |     raise CalledProcessError(retcode, process.args,
> 2024-01-24 11:09:13,657 update_cluster_job_status ERROR    | subprocess.CalledProcessError: Command '['ssh', 'cbe.vbc.ac.at', 'squeue', '-j', '64918082']' returned non-zero exit status 1.

Hi @DerLorenz, please can you post

  • the outputs of the command
cat /users/svc_cryosparc/cryosparc_worker/config.sh
  • the contents of the queue_sub_script.sh file for one of the affected jobs

Hi,
I’m adding some system information for @DerLorenz

$ cat /users/svc_cryosparc/cryosparc_worker/config.sh

export CRYOSPARC_LICENSE_ID="<license_id_redacted_by_me>"
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/software/2020/software/cudacore/11.1.1"
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=29
export CRYOSPARC_CACHE_NUM_THREADS=8
export CRYOSPARC_IMPROVED_SSD_CACHE=true

The submission script (it’s not of the original job id, but this one is with the same error:


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
##
## provide this information together with a link to this website in Helpdesk tickets at: http://support.vbc.ac.at"
## ===========================================
## Cryosparc project: P386 J73
## Cryosparc user: lorenz.grundmann@imp.ac.at
## Cryosparc job dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73
## Cryosparc project dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys
## Cryosparc job log: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log
## ===========================================
##
#### cryoSPARC cluster submission script template for SLURM
## variables via: https://guide.cryosparc.com/setup-configuration-and-management/how-to-download-install-and-configure/downloading-and-installing-cryosparc#connect-a-managed-worker-to-cryosparc
## Available variables:
## run_cmd: ~/cryosparc_worker/bin/cryosparcw run --project P386 --job J73 --master_hostname imp-cryosparc-1.vbc.ac.at --master_command_core_port 39002 > /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log 2>&1             - the complete command string to run the job
## num_cpu: 2            - the number of CPUs needed
## num_gpu: 1            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## ram_gb: 24.0             - the amount of RAM needed in GB
## job_dir_abs: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73        - absolute path to the job directory
## project_dir_abs: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys    - absolute path to the project dir
## job_log_path_abs: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log   - absolute path to the log file for the job
## worker_bin_path: ~/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## run_args: --project P386 --job J73 --master_hostname imp-cryosparc-1.vbc.ac.at --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## project_uid: P386        - uid of the project
## job_uid: J73            - uid of the job
## job_creator: lorenz.grundmann@had        - name of the user that created the job (may contain spaces)
## cryosparc_username: lorenz.grundmann@imp.ac.at - cryosparc username of the user that created the job (usually an email)
## job_type: class_2D_new           - cryosparc job type
##                                    defined in /cryosparc_master/cryosparc_compute/jobs/jobregister.py
##
## ====== parameters ======
## 'cbe_time': 
##   - 'auto': select based on job_type(TBD), currently: 8h
##   - 'short', 'medium', 'long' - to select max run duration for that QOS (8h, 48h, 168h)
##   - max runtime in MINUTES
## 'cbe_ram_multiplier': 
##   - default = 1.0
##
## inputs:
##   cbe_time=auto
##   cbe_ram_multiplier=1.0

## configured qos_limit: {'short': 480, 'medium': 2880, 'long': 20160}

## cbe_time auto select -> short, or based on job type: class_2D_new


## auto time select: qos=short for job_type: class_2D_new

## cbe_time: 480
## qos: short
##
## What follows is a simple SLURM script:
#SBATCH --job-name cs_lorenz.grundmann_P386_J73_class_2D_new
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --partition g
#SBATCH --qos short
#SBATCH --time 480
#SBATCH --mem=24000MB
#SBATCH -o /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/P386_J73.out
#SBATCH -e /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/P386_J73.err
#SBATCH -A had
#SBATCH --comment 'cbe_time=auto;cbe_ram_multiplier=1.0;qos=short;job_type=class_2D_new;'


echo "starting Cryosparc work ..."

~/cryosparc_worker/bin/cryosparcw run --project P386 --job J73 --master_hostname imp-cryosparc-1.vbc.ac.at --master_command_core_port 39002 > /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log 2>&1 

#
# provide the output below for servicedesk requests: http://support.vbc.ac.at
#
cat << EOF | tee "/groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/jobinfo.txt"
Cryosparc job resource consumption summary:

cryosparc project/job: P386 J73
slurm jobid: $SLURM_JOBID
cryosparc user: lorenz.grundmann@imp.ac.at
job_dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73
project_dir: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys
job_log: /groups/haselbach/cryosparc_projects_folder/active_instance/imports/CS-231229-helmutatpys/J73/job.log

job_type: class_2D_new
cbe_time: auto
cbe_ram_multiplier: 1.0
cryosparc ram_gb: 24.0
time_limit: 480
qos: short

# =========================================== 

$(jobinfo $SLURM_JOBID)

# =========================================== 

$(scontrol show job $SLURM_JOBID)

# =========================================== 

EOF


==========================================================================
==========================================================================

Thanks @ebirn for providing these details.
Please can you describe the cache implementation, including details like

  • Does the cache_path as defined for the CryoSPARC target include symbolic links.
  • Is the storage for cache_path local to the compute not or shared between compute nodes.
  • In case of share cache storage, please can you post storage server (export) and client (mount) configuration settings/options.

Hi,

  • The cache path configured for the cache does not contain any symlinks
  • it is a shared filesystem between compute nodes
  • filesystem type is Beegfs. It is a shared filesystem for multiple users and mounted on all compute nodes in the cluster. It is constantly used for various temporary data (as scratch filesystem)

There is sufficient space in the filesystem and the user running the cryosparc jobs does have access to the filesystem. While we can sporadically reproduce this cache issue, I don’t believe it is an issue with the underlying filesystem. I think the over all error rates would be higher than what we see.

@DerLorenz can you confirm this is happening only to specific job type(s), and with those not 100% of the time reproducible?

Best,
Erich

Thanks for looking into this @wtempel.

All jobs that require caching do fail in 1/3 of all submissions. We brut force the jobs by relaunching untill it eventually works. Consequently, the statistics posted by @ebirn do not reflect the current status as we simply restart the job. Once it finished the initially failed jobs will not be registered as “failed”.

@wtempel is there any other information that we can provide in order to help this debugging along?

1 Like

There is. We would like to know whether the problems are related to the filesystem type (of the cache storage). Would it be possible for you to test caching reliability when cache storage is implemented on a different filesystem type? For example, are workers’ /tmp directories implemented on a different filesystem type, and have sufficient capacity to temporarily provide cache storage?

We’ve tried to reproduce the issue on our 2nd Cryosparc instance, but we were unable to do so.
On the 2nd instsance, the caches are per-node local SSDs.
The instance where the issue is triggered, it is a cluster lane with a shared filesystem (Beegfs).

We observe this as a regression on the shared filesystem, since we’ve updated to v4.4.1. We assume it’s related to the cache locking implementation on BeeGFS.

Do you recognise this as a bug, and otherwise what is your recommended setup in this situation?
Should we disable the cache? Or, is there a way to configure a cache path that is scoped to each compute job in the cluster (i.e. Dynamic SSD Cache Paths), but on an ephemeral filsystem?
i.e. /beegfs/per-job-dir that will be wiped after the job run. - My impression was that the cluster lane effectively requires a shared filesystem as cache.

There is at least one way:

  1. a (slurm.conf) Prolog= script that creates a /scr/slurm_job_${SLURM_JOB_ID} directory and chowns it to $SLURM_JOB_USER
  2. combined with a TaskProlog= script that echos a variable definition that points to directory created in the previous step, like
    echo "export SLURM_SCR=/scr/slurm_job_${SLURM_JOB_ID}"
    
  3. combined with a line in the in the cluster script template like
    export CRYOSPARC_SSD_PATH=$SLURM_SCR
    
    It may be possible to instead just
    echo "export CRYOSPARC_SSD_PATH=/scr/slurm_job_${SLURM_JOB_ID}"
    
    in the task prolog (previous step), but I have not tested that (see CryoSPARC environment variables).
  4. combined with an Epilog= script that removes the job-specific scratch directory.

Worker host-specific, non-shared scratch storage is (supposed to be) supported, and the most performant option (nvme ssd), as far as I know. Did you find otherwise?