Ssd cache issue

Jason · March 23, 2023, 5:19pm

I performed all the steps but unfortunately that did not help at all. The first job I ran with SSD caching after clearing the cache within the mongoDB failed.

donghuachen · March 24, 2023, 6:24am

Hi @wtempel,
I guess I am having the same problem.
Resetting SSD cache did not work. See the error below.

Master running v4.2.1, worker running v4.2.1
SSD cache : cache successfully synced in_use
SSD cache : cache successfully synced, found 0.00MB of files on SSD.

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 125, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/data3/cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 114, in read_blobs
    u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
  File "/data3/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 112, in download_and_return_cache_paths
    compressed_keys = get_compressed_keys(worker_hostname, rel_paths)
  File "/data3/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 285, in get_compressed_keys
    compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths))
  File "/data3/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/data3/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 165, in make_request
    with urlopen(request, timeout=client._timeout) as response:
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/data3/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

frozenfas · March 25, 2023, 6:33am

HI I just want to report a similar problem, I am happy to email any logs you need.


Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 125, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 114, in read_blobs
    u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 112, in download_and_return_cache_paths
    compressed_keys = get_compressed_keys(worker_hostname, rel_paths)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 285, in get_compressed_keys
    compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths))
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 165, in make_request
    with urlopen(request, timeout=client._timeout) as response:
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/cryosparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

frozenfas · March 27, 2023, 10:17am

to add to my comment above, resetting the cache (options 2 and 3) and restarting cryosparc had no effect.

nwong · March 27, 2023, 2:47pm

Hi all,

Thank you for posting about these issues relating to the caching system. We are looking into them on our end.

frozenfas · March 31, 2023, 5:00am

HI @nwong I just wanted to say that on my instance this caching issue has been mitigated for the last few days. I noticed that my master node had a lot of swapping activity so I rebooted it. It had been up for over 180 days. After rebooting the caching error has not reapeared nor have I seen the high swapping activity.

nwong · April 3, 2023, 3:31pm

@Jason,
Could you send the output of
csm cli "get_scheduler_targets()"
as well as specify which worker nodes run into cache issues?

I’m also interested in if the master node is also being used as a GPU worker node. We currently suspect some caching issues may be related to master nodes being overloaded and unable to respond to worker cache requests.

@donghuachen,
Similarly to my response to Jason, could you send us the output of csm cli "get_scheduler_targets()"? Is there any indication of the master node being overloaded on your system when the cache is active?

@frozenfas,
Glad to hear it’s working for you again. Please let us know if there is a pattern with the master node swapping and cache system failure, as well as if restarting resolves the issue if it happens again in the future.

Jason · April 3, 2023, 3:58pm

@nwong all the workers have issues. Most prominently affected are Athena, Poseidon and Hydra. The master (Cerebro) is not being used as a GPU worker node. Here is the output you requested:

cryosparc_user@cerebro:~$ cryosparcm cli “get_scheduler_targets()”
[{‘cache_path’: ‘/scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 4, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 5, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 6, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 7, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘hydra.biosci.utexas.edu’, ‘lane’: ‘Hydra’, ‘monitor_port’: None, ‘name’: ‘hydra.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@hydra.biosci.utexas.edu’, ‘title’: ‘Worker node hydra.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 4, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 5, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 6, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 7, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘athena.biosci.utexas.edu’, ‘lane’: ‘Athena’, ‘monitor_port’: None, ‘name’: ‘athena.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@athena.biosci.utexas.edu’, ‘title’: ‘Worker node athena.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434652672, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘poseidon.biosci.utexas.edu’, ‘lane’: ‘Poseidon’, ‘monitor_port’: None, ‘name’: ‘poseidon.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘cryosparc_user@poseidon.biosci.utexas.edu’, ‘title’: ‘Worker node poseidon.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/var/home/cryosparc_user/cryosparc_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 8510701568, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 1, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 2, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 3, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}], ‘hostname’: ‘javelina.biosci.utexas.edu’, ‘lane’: ‘Javelina’, ‘monitor_port’: None, ‘name’: ‘javelina.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@javelina.biosci.utexas.edu’, ‘title’: ‘Worker node javelina.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/data1/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw’}, {‘cache_path’: ‘/scratch/cryosparc_cache’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 1, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 2, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 3, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 4, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 5, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 6, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}, {‘id’: 7, ‘mem’: 8513978368, ‘name’: ‘NVIDIA GeForce GTX 1080’}], ‘hostname’: ‘roadrunner.biosci.utexas.edu’, ‘lane’: ‘Roadrunner’, ‘monitor_port’: None, ‘name’: ‘roadrunner.biosci.utexas.edu’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], ‘GPU’: [0, 1, 2, 3, 4, 5, 6, 7], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, ‘ssh_str’: ‘cryosparc_user@roadrunner.biosci.utexas.edu’, ‘title’: ‘Worker node roadrunner.biosci.utexas.edu’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/local/home/cryosparc_user/cryosparc2_worker/bin/cryosparcw’}]

frozenfas · April 3, 2023, 4:22pm

Hi @nwong, actually the caching error repeated last night and this morning. This time there was no swapping but the only way to resolve the error was to restart the master (the cache error is on a seperate worker node). Before I restarted I tried option 2 to reset the cache (Option 2: Fix database inconsistencies) and there was a number of records in an inconsistence state (= the number of mics). Fixing these did not solve the caching error… only a reboot of the master node helps in my case.

The orginal error could have been precipitated by overloading the master as I started two jobs simultaneously that would use the same data. But after the error occured I tried to be gentle and just re-started one job and it continously failed until I restarted to master.

nwong · April 3, 2023, 8:13pm

@Jason, @frozenfas,

Thanks for your swift feedback. A new patch has been created which allows for retries on timed out cache-related requests by worker nodes to the master node. This is not expected to be a complete solution to your issues, but may help us make progress in understanding the problem. Please let us know if the patch works to at least allow jobs using the cache to run to completion, even if they may take significant time.

yodamoppet · April 11, 2023, 1:58pm

Greetings,

We have installed the latest 4.2.1 patches – appreciate the work on this caching issue.

We are still having the issue though. Cache is reporting as locked at times when it should be available and also we get timeouts.

Is there any further info we can provide to help solve the issue?

frozenfas · April 12, 2023, 8:40am

I just want to follow up on @yodamoppet. Similarly, we appreciate the work on the caching issue but the issue remains. As before the only way to overcome the issue is the reboot the master node.

wtempel · April 12, 2023, 3:19pm

Please can you email us the error report of an affected job and the output of
cryosparcm snaplogs that you collect just after you observe the problem.

nwong · April 26, 2023, 2:43pm

Thanks to all who have helped identify the issue,

We have identified some potential fixes and improvements to the cache system and are looking forward to implementing them in a future release. In the meantime as a workaround, please reset the cache if jobs get stuck when caching or fail to delete cache files on their own properly. It’s also recommended to cache smaller datasets (relative to the total ssd cache space) if possible, as we suspect it may be related.

bsobol · July 13, 2023, 9:23am

Do you have any updates on the fix or possible timeline?

wtempel · July 13, 2023, 3:45pm

@bsobol Please can you post

your CryoSPARC installation release and patch versions
the output of the command
cryosparcm cli "get_scheduler_targets()"
error message(s) from specific jobs’ event logs and the command_core log that indicate the specific caching issue in your case

bsobol · July 13, 2023, 4:39pm

I’m administrating cryoSPARC installation on HPC cluster. We still have users facing the cache-timeout-related issue.

We’re running cryoSPARC 4.2.1-230403.
Error is the same as for example in Ssd cache issue - #22 by donghuachen , around rc.cli.cache_request_check
what exactly do you need from cryosparcm cli "get_scheduler_targets()"? It’s quite a long output. We have 'cache_quota_mb': 10000000, 'cache_reserve_mb': 10000, by default.

wtempel · July 13, 2023, 5:39pm

The following details would be relevant:

the type of the CryoSPARC instance(s): single workstation, connected worker(s), or connected cluster(s)?
is the cache storage shared between workers?
do any two or more targets in the output have the same cache_path value?

bsobol · July 13, 2023, 6:08pm

It’s a cluster (slurm) instance. Every user/research group has its own instance (but all started from the same CS install dir) with their license running on cluster node, and jobs are submitted to other nodes.
Cache storage is shared between workers, it’s lustre with ZFS and NVME underneath.
If by targets, you mean cluster lanes - yes. Each cryoSPARC instance (so every user/research group) has a dedicated directory for cache, and all lanes in the instance use this path.

wtempel · July 13, 2023, 6:11pm

Both the 230621 patch for v4.2.1 and a new release, expected soon, include changes that may mitigate the issue.