SSD cache cleanup error?

ozej8y · May 24, 2022, 6:35am

Hi CryoSPARC,

I have cryosparc configured to submit SLURM jobs for the worker.
Under version 3.3.2, I have seen the following error for the job “Refinement New”

[CPU: 382.2 MB]  Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py", line 125, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/mnt/nvme/cryosparc-uow/cryosparc_worker/cryosparc_compute/particles.py", line 88, in read_blobs
    u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
  File "/mnt/nvme/cryosparc-uow/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 148, in download_and_return_cache_paths
    free_mb, used_mb, need_mb, other_instance_ids)
  File "/mnt/nvme/cryosparc-uow/cryosparc_worker/cryosparc_compute/client.py", line 65, in func
    assert 'error' not in res, f"Encountered error for method \"{key}\" with params {params}:\n{res['error']['message'] if 'message' in res['error'] else res['error']}"
AssertionError: Encountered error for method "cache_get_files_to_delete" with params ('MASSIVE', 'P70', 'J104', -9297.952941894531, 3009297.9529418945, 0, []):
ServerError: Traceback (most recent call last):
  File "/mnt/nvme/cryosparc-uow/cryosparc_master/cryosparc_command/command_core/__init__.py", line 150, in wrapper
    res = func(*args, **kwargs)
  File "/mnt/nvme/cryosparc-uow/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2858, in cache_get_files_to_delete
    clear_mb += hit['size_mb']
KeyError: 'size_mb'

This error reads to me that CryoSPARC has stuggled to free files from the SSD cache.
Did I miss something in the configuration or is it a bug ?

Thanks,
Jay.

wtempel · May 30, 2022, 3:56pm

One or more database cache record(s) may be in an inconsistent state.
The problem may be resolved by entering the mongo command environment:
cryosparcm mongo
and executing:
db.cache_files.updateMany({'status': 'hit', 'size_mb': {'$exists': false}}, {$set: {size_mb: 0}})
Please ensure there are no active jobs that use the cache when you execute the command above.

ozej8y · June 6, 2022, 5:55am

Thanks @wtempel. I have resolved the issue.
I have tweaked the setup for my SLURM submission.

The SSD cache was not configured to use slurm managed storage.
Effectively, different worker jobs managed by SLURM were cleaning up other jobs cached data.

wtempel · June 6, 2022, 1:22pm

Thank you @ozej8y for this update.

wtempel · August 25, 2022, 7:27pm

A new patch includes some cache-related improvements.