Cryosparc job fails after a long SSD cache checking

sunch · July 5, 2021, 11:12pm

We have a cluster installation of the latest cryosparc (v3.2.0+210601) with 4 workers connected through InfiniBand.
For some reason, cryosparc jobs involved SSD cache would fail after a long SSD cache.

Below is the output of a recent heterogeneous refinment job.

Basically, after cryosparc found 1.2 TB cache, the job stalled for 15 minutes before stopping with a read timeout error. Any idea why cryosparc stalled?

The cryosparcm joblog output is pasted below. Any suggestion or advice is welcomed.

================= CRYOSPARCW =======  2021-07-05 15:24:43.320515  =========
Project P180 Job J224
Master henry4.ohsu.edu Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 309070
MAIN PID 309070
hetero_refine.run cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job  J224  of type  hetero_refine
Running job on hostname %s henry5
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'henry5', 'lane': 'henry5_2', 'lane_type': 'henry5_2', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1]}, 'target': {'cache_path': '/henry5/scratch2/cryosparc/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'henry5', 'lane': 'henry5_2', 'monitor_port': None, 'name': 'henry5', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, 'ssh_str': 'cryosparc@henry5', 'title': 'Worker node henry5', 'type': 'node', 'worker_bin_path': '/home/cryosparc/software/cryosparc/cryosparc2_worker/bin/cryosparcw'}}
*** client.py: command (henry4.ohsu.edu:39002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (henry4.ohsu.edu:39002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://henry4.ohsu.edu:39002/api) did not reply within timeout of 300 seconds, attempt 3 of 3
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

stephan · July 7, 2021, 4:13pm

Hi @sunch,

Is it possible if you can copy+paste the full traceback from the heterogeneous refinement job?
Also, this seems like a genuine timeout error since the cache is trying to organize about 4.5m particles. Is it possible if you can edit line ~30 inside the file cryosparc_master/cryosparc_compute/client.py that looks like this:

def __init__(self, host="localhost", port=39002, url="/api", timeout=300):
to instead have a timeout of 900? The line should look like this after your edit:
def __init__(self, host="localhost", port=39002, url="/api", timeout=900):

Also, I’d recommend using the “Cache Particles on SSD” job to pre-cache your particles (make sure the job is queued to the same lane as the subsequent Heterogeneous Refinement job) before the Heterogeneous Refinement job runs.

sunch · July 8, 2021, 11:49pm

Thank you Stephan for the quick response! The trick to edit the timeout limit is useful and has been incorporated into our cryosparc instance. It is quite possible that this was a genuine timeout error because our file system went through some rebuilding around the time of the failed job.

I have cleared that job afterwards and haven’t got the chance to test it on that specific worker, so I cannot verify whether the issue is resolved or provide the full trace here. But I will update here later.

BTW, could you describe briefly when does cryosparc lock and release the data files during the initial data copy?

sunch · July 13, 2021, 5:05pm

Hi stephan, I think I found the issue. The timeout was probably caused by a busy master rather than the heavy IO of the worker. This hypothesis was based on another observation of the same timeout error-- it (a NU-refinement job on a worker lane) threw out the timeout error when the master lane was running on full load, but worked fine when repeated later with the master lane cleared out.

stephan · August 25, 2021, 9:28pm

Hi @sunch,

Thank you for the update. In that case, we’ll ensure the timeout is configurable via an environment variable.

Jobs that process particles lock files at the beginning of the job, before they attempt to copy them to the cache. The files are locked during this copy, and subsequently released once this process is finished. This ensures no two jobs try to copy both files at the same time to the cache. If another job runs that processes the same set of particles, it will wait for the lock to be released, then ensure that all the particle file it needs are available on the cache before continuing to process. Multiple jobs can access particle files that are on the cache at the same time during processing- just not during copying.