Ssd cache issue

to add to my comment above, resetting the cache (options 2 and 3) and restarting cryosparc had no effect.

Hi all,

Thank you for posting about these issues relating to the caching system. We are looking into them on our end.

1 Like

HI @nwong I just wanted to say that on my instance this caching issue has been mitigated for the last few days. I noticed that my master node had a lot of swapping activity so I rebooted it. It had been up for over 180 days. After rebooting the caching error has not reapeared nor have I seen the high swapping activity.

@Jason,
Could you send the output of
csm cli "get_scheduler_targets()"
as well as specify which worker nodes run into cache issues?

I’m also interested in if the master node is also being used as a GPU worker node. We currently suspect some caching issues may be related to master nodes being overloaded and unable to respond to worker cache requests.

@donghuachen,
Similarly to my response to Jason, could you send us the output of csm cli "get_scheduler_targets()"? Is there any indication of the master node being overloaded on your system when the cache is active?

@frozenfas,
Glad to hear it’s working for you again. Please let us know if there is a pattern with the master node swapping and cache system failure, as well as if restarting resolves the issue if it happens again in the future.

1 Like

@nwong all the workers have issues. Most prominently affected are Athena, Poseidon and Hydra. The master (Cerebro) is not being used as a GPU worker node. Here is the output you requested:

cryosparc_user@cerebro:~$ cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 4, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 5, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 6, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 7, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘hydra.biosci.utexas.edu’, ‘lane’: ‘Hydra’, ‘monitor_port’: None, ‘name’: ‘hydra.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@hydra.biosci.utexas.edu’, ‘title’: ‘Worker node hydra.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 4, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 5, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 6, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 7, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘athena.biosci.utexas.edu’, ‘lane’: ‘Athena’, ‘monitor_port’: None, ‘name’: ‘athena.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@athena.biosci.utexas.edu’, ‘title’: ‘Worker node athena.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘poseidon.biosci.utexas.edu’, ‘lane’: ‘Poseidon’, ‘monitor_port’: None, ‘name’: ‘poseidon.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@poseidon.biosci.utexas.edu’, ‘title’: ‘Worker node poseidon.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 8510701568, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 1, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 2, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 3, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}], ‘hostname’: ‘javelina.biosci.utexas.edu’, ‘lane’: ‘Javelina’, ‘monitor_port’: None, ‘name’: ‘javelina.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@javelina.biosci.utexas.edu’, ‘title’: ‘Worker node javelina.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data1/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 1, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 2, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 3, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 4, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 5, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 6, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 7, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}], ‘hostname’: ‘roadrunner.biosci.utexas.edu’, ‘lane’: ‘Roadrunner’, ‘monitor_port’: None, ‘name’: ‘roadrunner.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@roadrunner.biosci.utexas.edu’, ‘title’: ‘Worker node roadrunner.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw’}]

Hi @nwong, actually the caching error repeated last night and this morning. This time there was no swapping but the only way to resolve the error was to restart the master (the cache error is on a seperate worker node). Before I restarted I tried option 2 to reset the cache (Option 2: Fix database inconsistencies) and there was a number of records in an inconsistence state (= the number of mics). Fixing these did not solve the caching error… only a reboot of the master node helps in my case.

The orginal error could have been precipitated by overloading the master as I started two jobs simultaneously that would use the same data. But after the error occured I tried to be gentle and just re-started one job and it continously failed until I restarted to master.

@Jason, @frozenfas,

Thanks for your swift feedback. A new patch has been created which allows for retries on timed out cache-related requests by worker nodes to the master node. This is not expected to be a complete solution to your issues, but may help us make progress in understanding the problem. Please let us know if the patch works to at least allow jobs using the cache to run to completion, even if they may take significant time.

3 Likes

Greetings,

We have installed the latest 4.2.1 patches – appreciate the work on this caching issue.

We are still having the issue though. Cache is reporting as locked at times when it should be available and also we get timeouts.

Is there any further info we can provide to help solve the issue?

I just want to follow up on @yodamoppet. Similarly, we appreciate the work on the caching issue but the issue remains. As before the only way to overcome the issue is the reboot the master node.

Please can you email us the error report of an affected job and the output of
cryosparcm snaplogs that you collect just after you observe the problem.

Thanks to all who have helped identify the issue,

We have identified some potential fixes and improvements to the cache system and are looking forward to implementing them in a future release. In the meantime as a workaround, please reset the cache if jobs get stuck when caching or fail to delete cache files on their own properly. It’s also recommended to cache smaller datasets (relative to the total ssd cache space) if possible, as we suspect it may be related.

3 Likes

Do you have any updates on the fix or possible timeline?

@bsobol Please can you post

  • your CryoSPARC installation release and patch versions
  • the output of the command
    cryosparcm cli "get_scheduler_targets()"
  • error message(s) from specific jobs’ event logs and the command_core log that indicate the specific caching issue in your case

I’m administrating cryoSPARC installation on HPC cluster. We still have users facing the cache-timeout-related issue.

  • We’re running cryoSPARC 4.2.1-230403.
  • Error is the same as for example in Ssd cache issue - #22 by donghuachen , around rc.cli.cache_request_check
  • what exactly do you need from cryosparcm cli "get_scheduler_targets()"? It’s quite a long output. We have 'cache_quota_mb': 10000000, 'cache_reserve_mb': 10000, by default.

The following details would be relevant:

  1. the type of the CryoSPARC instance(s): single workstation, connected worker(s), or connected cluster(s)?
  2. is the cache storage shared between workers?
  3. do any two or more targets in the output have the same cache_path value?
  • It’s a cluster (slurm) instance. Every user/research group has its own instance (but all started from the same CS install dir) with their license running on cluster node, and jobs are submitted to other nodes.

  • Cache storage is shared between workers, it’s lustre with ZFS and NVME underneath.

  • If by targets, you mean cluster lanes - yes. Each cryoSPARC instance (so every user/research group) has a dedicated directory for cache, and all lanes in the instance use this path.

Both the 230621 patch for v4.2.1 and a new release, expected soon, include changes that may mitigate the issue.

Great! I’ll try the new patch.

I am running 2D class with Cryosparc V4.2.1+230621, and I encountered the SSD cache issue. The job logs “SSD cache : cache successfully synced” and never proceeds further. Then the job would failed with error. Any solutions?

Is this question a duplicate of 2D classification failed with SSD cache isssue ?