Cryosparc took hours to start when SSD cache is turned on

sunch · September 16, 2022, 5:20pm

In our lab, we have a master-worker setup of cryosparc. This setup has worked well for years, and we have accumulated a database of around 330G. Recently, we noticed that jobs requiring SSD cache were very slow to start while the same jobs without SSD were much more responsive. An example is shown below.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
With SSD
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[CPU: 68.6 MB] --------------------------------------------------------------
[CPU: 68.6 MB] Importing job module for job type homo_abinit…
[CPU: 210.4 MB] Job ready to run
[CPU: 210.6 MB] ***************************************************************
[CPU: 381.5 MB] Using random seed for sgd of 985626163
[CPU: 381.7 MB] Loading a ParticleStack with 247976 items…
[CPU: 383.8 MB] SSD cache : cache successfuly synced in_use
[CPU: 383.8 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
It stuck in this state for 1 hour and 10 minutes and got killed

joblog
================= CRYOSPARCW ======= 2022-09-15 16:33:28.196666 =========
Project P245 Job J502
Master henry4.ohsu.edu Port 39002

========= monitor process now starting main process
MAINPROCESS PID 417764
========= monitor process now waiting for main process
MAIN PID 417764
abinit.run cryosparc_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Without SSD
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[CPU: 68.7 MB] Importing job module for job type homo_abinit…
[CPU: 210.4 MB] Job ready to run
[CPU: 210.6 MB] ***************************************************************
[CPU: 380.3 MB] Using random seed for sgd of 691773547
[CPU: 380.3 MB] Loading a ParticleStack with 247976 items…
[CPU: 477.1 MB] Done.
[CPU: 477.2 MB] Windowing particles
[CPU: 477.2 MB] Done.
[CPU: 477.2 MB] Using 1 classes.
[CPU: 517.5 MB] Computing Ab-Initio Structure:
[CPU: 517.5 MB] Volume Size: 128 (voxel size 3.17A)
[CPU: 517.5 MB] Final Output Volume Size: 240
[CPU: 517.5 MB] Data Size: 240 (pixel size 1.69A)
[CPU: 517.5 MB] Resolution Range: 35.00A to 8.00A
[CPU: 517.5 MB] Fourier Radius Range: 11.6 to 50.7 with steps of 0.040000
[CPU: 517.7 MB] Using random seed for initialization of 119093666
[CPU: 517.7 MB] Generating random initial densities.
[CPU: 517.7 MB] Generating random initial density for class 0
[CPU: 884.5 MB] Done in 2.752s.
It took 20 seconds to generate the first density output from job launching

joblog
================= CRYOSPARCW ======= 2022-09-15 15:37:16.921910 =========
Project P245 Job J496
Master henry4.ohsu.edu Port 39002

========= monitor process now starting main process
MAINPROCESS PID 397265
========= monitor process now waiting for main process
MAIN PID 397265
abinit.run cryosparc_compute.jobs.jobregister
========= sending heartbeat

Running job J496 of type homo_abinit
Running job on hostname %s henry5
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘henry5’, ‘lane’: ‘henry5_2’, ‘lane_type’: ‘henry5_2’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1], ‘GPU’: [0], ‘RAM’: [0]}, ‘target’: {‘cache_path’: ‘/henry5/scratch2/cryosparc/’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘hostname’: ‘henry5’, ‘lane’: ‘henry5_2’, ‘monitor_port’: None, ‘name’: ‘henry5’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc@henry5’, ‘title’: ‘Worker node henry5’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparc/software/cryosparc/cryosparc2_worker/bin/cryosparcw’}}
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
========= sending heartbeat
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)
HOST ALLOCATION FUNCTION: using n.empty (CRYOSPARC_NO_PAGELOCK==true)

We have experienced slow starts with SSD cache previously, but cleaning up the cache folder would resolve this issue. But it didn’t help this time. Any ideas what is causing this and how to fix it are greatly appreciated. Thank you!

wtempel · September 16, 2022, 5:45pm

On what type of device and filesystem is /henry5/scratch2/cryosparc/?
df -h /henry5/scratch2/cryosparc/
stat -f /henry5/scratch2/cryosparc/
At what version and patch level is your cryoSPARC instance?

sunch · September 16, 2022, 5:58pm

Our cryosparc is v3.3.2 without patch 220824.
The SSD cache on henry5 is a raid0 with 2 7TB SSDs.

/dev/md128 12T 4.4T 6.9T 40% /henry5/scratch2
File: “/henry5/scratch2/cryosparc”
ID: 98000000000 Namelen: 255 Type: xfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 2999969024 Free: 1828206587 Available: 1828206587
Inodes: Total: 1200196288 Free: 1200128260

wtempel · September 16, 2022, 6:33pm

Thank you for this information.

Some more general questions:
How long have you been observing this behavior?
Have the symptoms deteriorated over time?

A few questions to rule out some potential issues outside cryoSPARC:
Are the SSDs healthy (not degraded as may be expected after writing large amounts of data)?
Were they in use by other applications when you ran the ab initio job (with caching) mentioned above?

cryoSPARC-related:
Are jobs with caching just slow, but would complete if allowed enough time?

sunch · September 16, 2022, 8:31pm

In general, jobs with SSD cache will have an overhead of up to 10 minutes when the SSD cache folder is full. But The issue described up (with > 1-hour unresponsive wait) showed up about 2 days ago.

Our SSDs are mostly although not exclusively used by cryosparc, and they are healthy (not many errors when tested with the smartctl tool). With that being said, it is quite common to have 3 or 4 jobs using SSD running on the same worker in our lab.

Though somewhat slower, jobs without caching can complete.

One quick question, would issues with the cryosparc database cause the long wait of SSD jobs?

Thank you for your help!

wtempel · September 16, 2022, 8:59pm

That is worth investigating. On what type of storage is $CRYOSPARC_DB_PATH?

cryosparcm env
df -h $CRYOSPARC_DB_PATH

What about jobs with caching?

sunch · September 16, 2022, 9:22pm

Our cryosparc database is located on a raid5 partition, and there is enough space.
Filesystem Size Used Avail Use% Mounted on
/dev/md127 5.1T 1.6T 3.5T 32% /

For the jobs with SSD cache on, if they can pass the SSD checking step, then they can finish without issue.

sunch · September 19, 2022, 9:59pm

cryosparc jobs with SSD suddenly work again, it takes about 5 minutes to finish all the additional checking steps specific to the SSD cache. There was no restart of the computer or software. What do you think is causing this hiccup?

jonathanjih · September 20, 2022, 4:56pm

Hello all, to hop in on this, over the past month our lab has been experiencing pretty much the exact same issue @sunch has described. i.e. jobs will hang at the SSD checking stage and will often time out. If the job can progress beyond this stage, it will usually run to completion. However, once the problem rears, jobs rarely pass that stage. Restarting nodes/cryosparc and clearing cache also does not seem to resolve the issue. We are on the latest version of cryosparc and SSDs are healthy.

We’ve since attempted to isolate this problem with multiple installations of cryosparc on our cluster testing different variables. We’ve identified two factors that seem to have an effect on the observed issue.

The first is any time we connect our original database (55 GB), the issue arises. This is very repeatable, as the same install linked to a fresh database will then run fine.

The second seems to be that importing any jobs or result groups causes the same. Importing jobs will often hang for 12+ hours before eventually failing. During the interim and after, SSD caching will not complete.

Unlike @sunch, so far the problem hasn’t resolved itself for us yet, so any pointers would be appreciated. Thanks!

wtempel · September 21, 2022, 4:40pm

Welcome to the forum @jonathanjih, and thank you for this informative inaugural post.

wtempel · September 21, 2022, 5:04pm

@sunch @jonathanjih Please can you capture command_core log entries
(cryosparcm log command_core) at a time when a job appears stuck due to caching and/or during a job import.
If, unlike in @sunch’s case,

the problem persists, you may attempt the following intervention:

ensure no CryoSPARC job is running
remove the instance_${CRYOSPARC_MASTER_HOSTNAME}:$((CRYOSPARC_BASE_PORT+1)) subdirectories inside $CRYOSPARC_SSD_PATH on all workers
remove cache-related database records: inside a mongo shell (cryosparcm mongo), run:
db.cache_files.deleteMany({})

Does this help?

sunch · September 21, 2022, 8:42pm

Thank you very much for the tips! I have tried to clear the cache folder, but that didn’t help.

I will try the database command if the SSD issue recurs in the future and report here.

Cryosparc took hours to start when SSD cache is turned on

joblog ================= CRYOSPARCW ======= 2022-09-15 16:33:28.196666 ========= Project P245 Job J502 Master henry4.ohsu.edu Port 39002

joblog ================= CRYOSPARCW ======= 2022-09-15 15:37:16.921910 ========= Project P245 Job J496 Master henry4.ohsu.edu Port 39002

joblog
================= CRYOSPARCW ======= 2022-09-15 16:33:28.196666 =========
Project P245 Job J502
Master henry4.ohsu.edu Port 39002

joblog
================= CRYOSPARCW ======= 2022-09-15 15:37:16.921910 =========
Project P245 Job J496
Master henry4.ohsu.edu Port 39002