Refinements fail after SSD caching

Dear all,

We are running cryosparc on an HPC cluster using slurm. The version is upadted to v3.4. and certain errors started to occur when running homo-, non-uniform refinements as well as 2D and 3D classifications. All jobs fail after or while performing the SSD-caching. Two major types of errors occur:
1.


2. The job just terminates “abnormally”
Screenshot 2023-08-28 at 12.45.21

Is anyone else having similar issues and what could we possibly do to fix this error?

Best,
Lorenz

@DerLorenz Please can you:

  • double-check the CryoSPARC version
  • post error messages as text
  • additionally post error messages from the job log.
  • inspect the command_core log for error messages

Hi,

Thank you for your reply.

I mixed up the numbers of the version, sorry for that! The version is 4.3.0 of course.
Error 1:

[CPU: 319.4 MB]
Using random seed of 1846738000

[CPU: 336.7 MB]
Loading a ParticleStack with 160351 items…

[CPU: 304.9 MB]
Traceback (most recent call last):
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 105, in func
with make_json_request(self, “/api”, data=data) as request:
File “/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter
return next(self.gen)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 192, in make_request
raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://imp-cryosparc-1.vbc.ac.at:39002/api) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 96, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/class2D/run.py”, line 64, in cryosparc_compute.jobs.class2D.run.run_class_2D
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py”, line 115, in read_blobs
u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 111, in download_and_return_cache_paths
rc.cli.cache_sync_in_use(worker_hostname, rc._project_uid, rc._job_uid) # ignore self
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 108, in func
raise CommandClient.Error(
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://imp-cryosparc-1.vbc.ac.at:39002) Did not receive a JSON response from method “cache_sync_in_use” with params (‘CBE’, ‘P297’, ‘J48’)
job-log:

Running job J48 of type class_2D
Running job on hostname %s CBE
Allocated Resources : {‘fixed’: {‘SSD’: True}, ‘hostname’: ‘CBE’, ‘lane’: ‘CBE’, ‘lane_type’: ‘cluster’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1], ‘GPU’: [0], ‘RAM’: [0, 1, 2]}, ‘target’: {‘cache_path’: ‘/scratch-cbe/users/svc_cryosparc/lane_cache_CBE’, ', ‘job_dir_abs’, ‘job_log_path_abs’, ‘ram_gb’], ‘type’: ‘cluster’, ‘worker_bin_path’: ‘~/cryosparc_worker/bin/cryosparcw’}}
*** CommandClient: command (http://imp-cryosparc-1.vbc.ac.at:39002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** CommandClient: command (http://imp-cryosparc-1.vbc.ac.at:39002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
**** handle exception rc
Traceback (most recent call last):
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 105, in func
with make_json_request(self, “/api”, data=data) as request:
File “/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter
return next(self.gen)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 192, in make_request
raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://imp-cryosparc-1.vbc.ac.at:39002/api) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 96, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/class2D/run.py”, line 64, in cryosparc_compute.jobs.class2D.run.run_class_2D
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py”, line 115, in read_blobs
u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 111, in download_and_return_cache_paths
rc.cli.cache_sync_in_use(worker_hostname, rc._project_uid, rc._job_uid) # ignore self
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 108, in func
raise CommandClient.Error(
cryosparc_tools.cryosparc.command.Error: *** CommandClient: (http://imp-cryosparc-1.vbc.ac.at:39002) Did not receive a JSON response from method “cache_sync_in_use” with params (‘CBE’, ‘P297’, ‘J48’)
set status to failed
========= main process now complete at 2023-08-28 10:04:23.777305.
========= monitor process now complete at 2023-08-28 10:04:23.785555.

job 2:

[CPU: 301.8 MB]
SSD cache : cache successfully synced in_use

[CPU: 173.1 MB]
====== Job process terminated abnormally.

log:

Allocated Resources : {‘fixed’: {‘SSD’: True}, ‘hostname’: ‘CBE’, ‘lane’: ‘CBE’, ‘lane_type’: ‘cluster’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1], ‘GPU’: [0], ‘RAM’: [0, 1, 2]}, ‘target’: {‘cache_path’: ‘/scratch-cbe/users/svc_cryosparc/lane_cache_CBE’,
*** CommandClient: command (http://imp-cryosparc-1.vbc.ac.at:39002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** CommandClient: command (http://imp-cryosparc-1.vbc.ac.at:39002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
**** handle exception rc
Process Process-1:
Traceback (most recent call last):
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 105, in func
with make_json_request(self, “/api”, data=data) as request:
File “/users/svc_cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter
return next(self.gen)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 192, in make_request
raise CommandClient.Error(client, error_reason, url=url)
cryosparc_tools.cryosparc.command.CommandClient.Error: *** CommandClient: (http://imp-cryosparc-1.vbc.ac.at:39002/api) Timeout Error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 96, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/class2D/run.py”, line 64, in cryosparc_compute.jobs.class2D.run.run_class_2D
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/particles.py”, line 115, in read_blobs
u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 114, in download_and_return_cache_paths
used_mb = sync_hits(worker_hostname, ssd_cache_path, instance_id)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py”, line 177, in sync_hits
rc.cli.cache_sync_hits(worker_hostname, com.compress_paths(keys), sizes_mb, rc._project_uid, rc._job_uid)
File “/users/svc_cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py”, line 108, in func

The comaand_core log and other system information will be attached by our IT-administrator later today.

J49 related logs from command_core.log

2023-08-28 09:35:46,357 dump_job_database    INFO     | Request to export P297 J49
2023-08-28 09:35:46,361 dump_job_database    INFO     |    Exporting job to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49
2023-08-28 09:35:46,364 dump_job_database    INFO     |    Exporting all of job's images in the database to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49/gridfs_data...
2023-08-28 09:35:46,739 dump_job_database    INFO     | Exported P297 J49 in 0.38s
2023-08-28 09:35:46,860 scheduler_run_core   INFO     | Jobs Queued: [('P297', 'J49'), ('P31', 'J3')]
2023-08-28 09:35:46,862 scheduler_run_core   INFO     | Now trying to schedule J49
2023-08-28 09:35:47,645 set_job_status       INFO     | Status changed for P297.J49 from queued to launched
2023-08-28 09:35:47,646 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:35:47,705 run_job              INFO     |       Running P297 J49
2023-08-28 09:35:47,737 run_job              INFO     |         cmd: ssh cbe.vbc.ac.at sbatch --wckey=cryosparc /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49/queue_sub_script.sh 
2023-08-28 09:36:01,272 set_job_status       INFO     | Status changed for P297.J49 from launched to started
2023-08-28 09:36:01,274 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:36:02,254 set_job_status       INFO     | Status changed for P297.J49 from started to running
2023-08-28 09:36:02,256 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:36:06,654 cache_sync_in_use    INFO     | Started cache sync in-use for CBE P297 J49
2023-08-28 09:36:08,148 cache_sync_in_use    INFO     | Finished cache sync in-use for CBE P297 J49
2023-08-28 09:37:43,878 cache_sync_hits      INFO     | Started cache sync hits for CBE P297 J49 (given 919744 keys)
2023-08-28 09:38:25,420 cache_sync_hits      INFO     | 919744 bulk updates by CBE P297 J49
2023-08-28 09:40:37,342 cache_sync_hits      INFO     | Bulk cache query complete by CBE P297 J49; modified 919744/919744 matches
2023-08-28 09:40:37,342 cache_sync_hits      INFO     | Finished cache sync hits for CBE P297 J49 (given 919744 keys)
2023-08-28 09:40:38,016 cache_request_check  INFO     | Started cache request check for P297 J49 on hostname CBE (given 15842 paths)
2023-08-28 09:45:38,110 cache_request_check  INFO     | Started cache request check for P297 J49 on hostname CBE (given 15842 paths)
2023-08-28 09:46:34,497 cache_request_check  INFO     | 15842 bulk updates by CBE P297 J49
2023-08-28 09:46:36,814 cache_request_check  INFO     | Bulk cache query complete by CBE P297 J49; modified 15842/15842 matches
2023-08-28 09:46:36,814 cache_request_check  INFO     | Finished cache request check for P297 J49 on hostname CBE (given 15842 paths)
2023-08-28 09:50:38,222 set_job_status       INFO     | Status changed for P297.J49 from running to failed
2023-08-28 09:50:38,227 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:51:27,706 dump_job_database    INFO     | Request to export P297 J49
2023-08-28 09:51:27,711 dump_job_database    INFO     |    Exporting job to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49
2023-08-28 09:51:27,713 dump_job_database    INFO     |    Exporting all of job's images in the database to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49/gridfs_data...
2023-08-28 09:51:27,733 dump_job_database    INFO     | Exported P297 J49 in 0.03s
2023-08-28 09:51:43,651 dump_job_database    INFO     | Request to export P297 J49
2023-08-28 09:51:43,657 dump_job_database    INFO     |    Exporting job to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49
2023-08-28 09:51:43,659 dump_job_database    INFO     |    Exporting all of job's images in the database to /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49/gridfs_data...
2023-08-28 09:51:43,675 dump_job_database    INFO     | Exported P297 J49 in 0.02s
2023-08-28 09:52:06,874 scheduler_run_core   INFO     | Jobs Queued: [('P297', 'J49'), ('P31', 'J3')]
2023-08-28 09:52:06,877 scheduler_run_core   INFO     | Now trying to schedule J49
2023-08-28 09:52:07,827 set_job_status       INFO     | Status changed for P297.J49 from queued to launched
2023-08-28 09:52:07,828 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:52:07,877 run_job              INFO     |       Running P297 J49
2023-08-28 09:52:07,914 run_job              INFO     |         cmd: ssh cbe.vbc.ac.at sbatch --wckey=cryosparc /groups/haselbach/cryosparc_projects_folder/active_instance/imports/230808_26s_atpys_pithd1/J49/queue_sub_script.sh 
2023-08-28 09:52:23,930 cache_request_check  INFO     | Finished cache request check for P297 J49 on hostname CBE (given 15842 paths)
2023-08-28 09:54:19,363 set_job_status       INFO     | Status changed for P297.J49 from launched to started
2023-08-28 09:54:19,365 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:54:21,494 set_job_status       INFO     | Status changed for P297.J49 from started to running
2023-08-28 09:54:21,496 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 09:54:24,219 cache_sync_in_use    INFO     | Started cache sync in-use for CBE P297 J49
2023-08-28 09:59:24,321 cache_sync_in_use    INFO     | Started cache sync in-use for CBE P297 J49
2023-08-28 10:04:24,383 set_job_status       INFO     | Status changed for P297.J49 from running to failed
2023-08-28 10:04:24,387 app_stats_refresh    INFO     | Calling app stats refresh url http://imp-cryosparc-1.vbc.ac.at:39000/api/actions/stats/refresh_job for project_uid P297, workspace_uid None, job_uid J49 with body {'projectUid': 'P297', 'jobUid': 'J49'}
2023-08-28 10:12:38,882 cache_sync_in_use    INFO     | Finished cache sync in-use for CBE P297 J49
2023-08-28 10:28:39,847 cache_sync_in_use    INFO     | Finished cache sync in-use for CBE P297 J49

I can’t see any errors/stack traces in the logs relating to this project (there are a bunch of other, unrelated traces, where project/workspace export does not work for some other users, as their storage quota is reached - but I can’t imagine a relation to this error.

There is another instance of what I believe to be the same error in another project/job:

it es followed by a very large json payload - and my assumption was that the large payload processing is causing the timeout.
(I can provide full dumps of the json payload and the system logs if needed)

note: using the job report button in the UI will regularly bring the cryosparc_app node process to an unresponsive state, the user interface will no longer work until “cryopsarcm restart app”

it seems to me, that the db queries (?, I’m not too familiar with mongodb) are taking long, at least contributing to the timeout, from the database.log around the time 2023-08-28 10:04 of the original posted error:

grep cache_files database.log

2023-08-28T10:01:30.795+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6253, $clusterTime: { clusterTime: Timestamp(1693209676, 1540), signature: { hash: BinData(0, 4262F521BEDC8E6E28D1746AA026CA184DEE1F10), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300013, w: 300013 } }, Database: { acquireCount: { w: 300013 } }, Collection: { acquireCount: { w: 200013 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13370ms
2023-08-28T10:01:45.136+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6254, $clusterTime: { clusterTime: Timestamp(1693209690, 5651), signature: { hash: BinData(0, 25753227525096010A74419DCA11BFF3EE998F7B), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300018, w: 300018 } }, Database: { acquireCount: { w: 300018 } }, Collection: { acquireCount: { w: 200018 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13181ms
2023-08-28T10:01:59.315+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6255, $clusterTime: { clusterTime: Timestamp(1693209705, 925), signature: { hash: BinData(0, C4514395E28C82C6EFE054B54E9E75EBAF60C66C), keyId: 7231586738655723533 } }, $db: "meteor
" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300035, w: 300035 } }, Database: { acquireCount: { w: 300035 } }, Collection: { acquireCount: { w: 200035 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13109ms
2023-08-28T10:02:13.714+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6256, $clusterTime: { clusterTime: Timestamp(1693209719, 2610), signature: { hash: BinData(0, 3B99C2E5B0DA91970C3E6D6C733B6220AB9FE1D4), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300022, w: 300022 } }, Database: { acquireCount: { w: 300022 } }, Collection: { acquireCount: { w: 200022 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13281ms
2023-08-28T10:02:27.961+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6257, $clusterTime: { clusterTime: Timestamp(1693209733, 5711), signature: { hash: BinData(0, EE88DA0092CE4E99815AD389FC81F424E6E16468), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300010, w: 300010 } }, Database: { acquireCount: { w: 300010 } }, Collection: { acquireCount: { w: 200010 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13160ms
2023-08-28T10:02:42.469+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6258, $clusterTime: { clusterTime: Timestamp(1693209747, 7134), signature: { hash: BinData(0, D1EACFB4CA85CB1DFE52D881B39063CBF449EF10), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300037, w: 300037 } }, Database: { acquireCount: { w: 300037 } }, Collection: { acquireCount: { w: 200037 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13481ms
2023-08-28T10:02:57.053+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6259, $clusterTime: { clusterTime: Timestamp(1693209762, 3759), signature: { hash: BinData(0, 6B783AE91FD247849CCFDD237B3F87007B7B6DF7), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300021, w: 300021 } }, Database: { acquireCount: { w: 300021 } }, Collection: { acquireCount: { w: 200021 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13553ms
2023-08-28T10:03:11.569+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6260, $clusterTime: { clusterTime: Timestamp(1693209777, 440), signature: { hash: BinData(0, 2844C48BD4937CB7AE0611BFBF46B2BFB6DD5BC9), keyId: 7231586738655723533 } }, $db: "meteor
" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300016, w: 300016 } }, Database: { acquireCount: { w: 300016 } }, Collection: { acquireCount: { w: 200016 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13412ms
2023-08-28T10:03:14.456+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6261, $clusterTime: { clusterTime: Timestamp(1693209791, 4753), signature: { hash: BinData(0, BF3981340E1853BF34429C0A53971D09B08F9E5A), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 59243, w: 59243 } }, Database: { acquireCount: { w: 59243 } }, Collection: { acquireCount: { w: 39499 } }, oplog: { acquireCount: { w: 19744 } } } protocol:op_msg 2578ms
2023-08-28T10:03:52.537+0200 I WRITE    [conn4246] update meteor.cache_files command: { q: { hostname: "CBE", status: { $in: [ "hit", "miss" ] } }, u: { $set: { status: "miss" } }, multi: true, upsert: false } planSummary: IXSCAN { hostname: 1, status: 1 } keysExamined:2036582 docsExamined:2036581 nMatched:1132679 nModified:903902 keysInserted:1807804 keysDeleted:2711706 numY
ields:15972 locks:{ Global: { acquireCount: { r: 919875, w: 919875 } }, Database: { acquireCount: { w: 919875 } }, Collection: { acquireCount: { w: 15973 } }, oplog: { acquireCount: { w: 903902 } } } 36328ms
2023-08-28T10:03:52.538+0200 I COMMAND  [conn4246] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("202c8933-e527-4e47-a63f-e0a9bff593d9") }, $clusterTime: { clusterTime: Timestamp(1693209795, 20), signature: { hash: BinData(0, A2C7D4BB8B4281C22062737BD1717D746178136A), keyId: 7231586738655723533 } }, $db: "meteor", $readPreference
: { mode: "primary" } } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 919875, w: 919875 } }, Database: { acquireCount: { w: 919875 } }, Collection: { acquireCount: { w: 15973 } }, oplog: { acquireCount: { w: 903902 } } } protocol:op_msg 36328ms
2023-08-28T10:04:21.334+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6314, $clusterTime: { clusterTime: Timestamp(1693209848, 18), signature: { hash: BinData(0, B89EBFEAFC1AD8DF8AC72989985770C2605DC5B4), keyId: 7231586738655723533 } }, $db: "meteor"
 } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 268359, w: 268359 } }, Database: { acquireCount: { w: 268359 } }, Collection: { acquireCount: { w: 184201 } }, oplog: { acquireCount: { w: 84158 } } } protocol:op_msg 11485ms
2023-08-28T10:04:35.750+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6315, $clusterTime: { clusterTime: Timestamp(1693209861, 2625), signature: { hash: BinData(0, FE0C4E507FB29DFE00FDF1D3F8224C58545696B1), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300046, w: 300046 } }, Database: { acquireCount: { w: 300046 } }, Collection: { acquireCount: { w: 200046 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13325ms
2023-08-28T10:04:50.051+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6316, $clusterTime: { clusterTime: Timestamp(1693209875, 5744), signature: { hash: BinData(0, C90FFC27F67189ACF0A20C3E38A12E0998A5A875), keyId: 7231586738655723533 } }, $db: "meteo
r" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300035, w: 300035 } }, Database: { acquireCount: { w: 300035 } }, Collection: { acquireCount: { w: 200035 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13025ms
2023-08-28T10:05:04.029+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6317, $clusterTime: { clusterTime: Timestamp(1693209890, 392), signature: { hash: BinData(0, 826A123AECDA8528D39C9E3526648A1194CCE0D6), keyId: 7231586738655723533 } }, $db: "meteor
" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300032, w: 300032 } }, Database: { acquireCount: { w: 300032 } }, Collection: { acquireCount: { w: 200032 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 12921ms
2023-08-28T10:05:18.348+0200 I COMMAND  [conn3924] command meteor.$cmd command: update { update: "cache_files", ordered: true, lsid: { id: UUID("17a626e6-0731-4c7e-817f-7d4fe9bb2ab2") }, txnNumber: 6318, $clusterTime: { clusterTime: Timestamp(1693209904, 205), signature: { hash: BinData(0, C8778B548BFAF5182EDB26C41F65A2FE28BDB306), keyId: 7231586738655723533 } }, $db: "meteor
" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 300037, w: 300037 } }, Database: { acquireCount: { w: 300037 } }, Collection: { acquireCount: { w: 200037 } }, oplog: { acquireCount: { w: 100000 } } } protocol:op_msg 13201ms

we’ve recently consolidated multiple cluster lanes (with ssd path) into one, i.e. the cache_files table might be quite bloated - “CBE” is the new consolidated lane, i.e.

meteor:PRIMARY> db.cache_files.count()
4155512

meteor:PRIMARY> db.cache_files.find({ hostname: { $ne: "CBE" } }).count()
2904600
meteor:PRIMARY> db.cache_files.find({ hostname: { $eq: "CBE" } }).count()
1250912

I am facing the same error currently, I was wondering if you eventually sloved the problem half-year ago?

I’m not sure if we found the actual root cause of this, bit we attributed it to an inflated cache_files collection in mongodb. This were artefacts remaining after we refactored our cluster lanes into a single cluster lane.
I believe we fixed it by stopping all cryosparc jobs, deleting the cache files entries from the db and removing all files from the cache directory and restarting Cryosparc. The issue has not reappeared since.

Currently (on v4.4.1), we’re facing a different issue, as discussed here Particles cache issues - #10 by ebirn