Cache-related error

yodamoppet · February 8, 2023, 10:20pm

We’re having a few issues post-upgrade…

First, we had an error:

Unable to create job: ServerError: validation error: lock file for P134 not found at /tank/userlab/user/P134/cs.lock``````

I was able to correct this by:

cryosparcm cli "take_over_projects('P134')"

But I’m not sure if we’ll have this issue with all projects now. Is there a wildcard for taking over projects?

So further, we queue up a job, it appears to start, but eventually times out with:

Traceback (most recent call last): File “cryosparc_master/cryosparc_compute/run.py”, line 96, in cryosparc_compute.run.main File “cryosparc_master/cryosparc_compute/jobs/class2D/run.py”, line 63, in cryosparc_compute.jobs.class2D.run.run_class_2D File “/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/particles.py”, line 114, in read_blobs u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths) File “/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/cache.py”, line 119, in download_and_return_cache_paths compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths)) File “/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_tools/cryosparc/command.py”, line 104, in func with make_json_request(self, “/api”, data=data) as request: File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py”, line 113, in enter return next(self.gen) File “/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_tools/cryosparc/command.py”, line 165, in make_request with urlopen(request, timeout=client._timeout) as response: File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py”, line 222, in urlopen return opener.open(url, data, timeout) File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py”, line 525, in open response = self._open(req, data) File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py”, line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py”, line 502, in _call_chain result = func(*args) File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py”, line 1383, in http_open return self.do_open(http.client.HTTPConnection, req) File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py”, line 1358, in do_open r = h.getresponse() File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py”, line 1348, in getresponse response.begin() File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py”, line 316, in begin version, status, reason = self._read_status() File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py”, line 277, in _read_status line = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”) File “/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/socket.py”, line 669, in readinto return self._sock.recv_into(b) socket.timeout: timed out

yodamoppet · February 9, 2023, 3:00pm

An update…

This seems to be related to caching somehow. If we run a job without caching, it will run.

I tried the procedure to reset the cache system using mongo commands and manually clearing the cache on each node. But we get the same result.

Caching is important to us, so we’d like to get this working…

wtempel · February 10, 2023, 3:48pm

I created a new (this) topic for these questions.

take_over_project("PX")
and
take_over_projects()
are closely related; the latter may emulate the wildcard behavior you mentioned.
(I suspect the cli command you typed erroneously includes an “s”)

I need to check with the CryoSPARC team about the cache-related error.

yodamoppet · February 10, 2023, 6:08pm

Oddly, the cache seems to have started working again.

This has happened a few times before after updates, with the issue appearing and then self-resolving.

Is there some process that locks the cache for maintenance after an upgrade? Is there some additional logging I can enable so we can diagnose if this reappears?

wtempel · February 17, 2023, 5:22pm

We suspect the caching could be impacted by

many concurrent cache management tasks
an unstable network or
some other timeout

Does a CRYOSPARC_CLIENT_TIMEOUT defined at 600 or higher inside
cryosparc_master/config.sh
and
cryosparc_worker/config.sh
improve the situation?

yodamoppet · February 21, 2023, 4:35pm

I will try adding this parameter to both config.sh files. Does it require a restart of cryosparc?

As for the other suggestions:

it is possible that there are concurrent cache processes – how can I view the impact performance of these?
The network is stable for other tasks and is high-speed infiniband. I do not believe it is network related.

Thank you for the assistance,

wtempel · February 21, 2023, 6:19pm

Modifications to cryosparc_master/config.sh need to be activated by a CryoSPARC restart.

wtempel · February 21, 2023, 6:26pm

Please can you email us a compressed copy of
cryosparc_master/run/command_core.log, which may include some clues about why this happens.

wtempel · June 9, 2023, 1:11pm

@yodamoppet Are you still experiencing this issue? If you are, please can you update this CryoSPARC instance to v4.2.1 and install patch 230427 and see if the issue is resolved in the patched instance?

yodamoppet · June 9, 2023, 1:23pm

Since installing the latest patches, we are no longer experiencing the issue. Cryosparc has been running very smoothly since installing patch 230427.

We greatly appreciate your efforts to solve the problem, and we’ll report back if any new issues arise. Thank you.