In our lab, we have a master-worker setup of cryosparc. This setup has worked well for years, and we have accumulated a database of around 330G. Recently, we noticed that jobs requiring SSD cache were very slow to start while the same jobs without SSD were much more responsive. An example is shown below.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
With SSD
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[CPU: 68.6 MB] --------------------------------------------------------------
[CPU: 68.6 MB] Importing job module for job type homo_abinit…
[CPU: 210.4 MB] Job ready to run
[CPU: 210.6 MB] ***************************************************************
[CPU: 381.5 MB] Using random seed for sgd of 985626163
[CPU: 381.7 MB] Loading a ParticleStack with 247976 items…
[CPU: 383.8 MB] SSD cache : cache successfuly synced in_use
[CPU: 383.8 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
It stuck in this state for 1 hour and 10 minutes and got killed
joblog
================= CRYOSPARCW ======= 2022-09-15 16:33:28.196666 =========
Project P245 Job J502
Master henry4.ohsu.edu Port 39002
========= monitor process now starting main process
MAINPROCESS PID 417764
========= monitor process now waiting for main process
MAIN PID 417764
abinit.run cryosparc_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Without SSD
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[CPU: 68.7 MB] Importing job module for job type homo_abinit…
[CPU: 210.4 MB] Job ready to run
[CPU: 210.6 MB] ***************************************************************
[CPU: 380.3 MB] Using random seed for sgd of 691773547
[CPU: 380.3 MB] Loading a ParticleStack with 247976 items…
[CPU: 477.1 MB] Done.
[CPU: 477.2 MB] Windowing particles
[CPU: 477.2 MB] Done.
[CPU: 477.2 MB] Using 1 classes.
[CPU: 517.5 MB] Computing Ab-Initio Structure:
[CPU: 517.5 MB] Volume Size: 128 (voxel size 3.17A)
[CPU: 517.5 MB] Final Output Volume Size: 240
[CPU: 517.5 MB] Data Size: 240 (pixel size 1.69A)
[CPU: 517.5 MB] Resolution Range: 35.00A to 8.00A
[CPU: 517.5 MB] Fourier Radius Range: 11.6 to 50.7 with steps of 0.040000
[CPU: 517.7 MB] Using random seed for initialization of 119093666
[CPU: 517.7 MB] Generating random initial densities.
[CPU: 517.7 MB] Generating random initial density for class 0
[CPU: 884.5 MB] Done in 2.752s.
It took 20 seconds to generate the first density output from job launching
joblog
================= CRYOSPARCW ======= 2022-09-15 15:37:16.921910 =========
Project P245 Job J496
Master henry4.ohsu.edu Port 39002
========= monitor process now starting main process
MAINPROCESS PID 397265
========= monitor process now waiting for main process
MAIN PID 397265
abinit.run cryosparc_compute.jobs.jobregister
========= sending heartbeat
Running job J496 of type homo_abinit
Running job on hostname %s henry5
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘henry5’, ‘lane’: ‘henry5_2’, ‘lane_type’: ‘henry5_2’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1], ‘GPU’: [0], ‘RAM’: [0]}, ‘target’: {‘cache_path’: ‘/henry5/scratch2/cryosparc/’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘hostname’: ‘henry5’, ‘lane’: ‘henry5_2’, ‘monitor_port’: None, ‘name’: ‘henry5’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc@henry5’, ‘title’: ‘Worker node henry5’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc/software/cryosparc/cryosparc2_worker/bin/cryosparcw’}}
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
We have experienced slow starts with SSD cache previously, but cleaning up the cache folder would resolve this issue. But it didn’t help this time. Any ideas what is causing this and how to fix it are greatly appreciated. Thank you!