Socket time out

Dear All,

Recently cryosparc has stopped working for us on several installations/machines. I am suspecting it may be some network issue but everything else seems to run fine so I am not sure why this is happening. I tried downgrading all the way down to v4.0.1 but this didn’t solve the issue. Has anyone experienced something similar? Any feedback will be greatly appreciated! This is the error I get when running a job:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 125, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/csparc/csparc_installation/cryosparc_worker/cryosparc_compute/particles.py", line 114, in read_blobs
    u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
  File "/csparc/csparc_installation/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 112, in download_and_return_cache_paths
    compressed_keys = get_compressed_keys(worker_hostname, rel_paths)
  File "/csparc/csparc_installation/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 285, in get_compressed_keys
    compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths))
  File "csparc/csparc_installation/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/csparc/csparc_installation/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 165, in make_request
    with urlopen(request, timeout=client._timeout) as response:
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/csparc/csparc_installation/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

Many thanks,
Daniel

Hi Daniel,
Did you observe this issue on a fully patched instance at v4.2.1?
Did you also ensure that all storage is continuously online? Failed project/live session seems to freeze the instance - #8 by daniel.s.d.larsson

Hi ,

I tried both patched and unpatched. I also checked the storage by pinging it - it seems it is fine, in 120min of pinging it didn’t drop a package and the average response time was 0.1ms (pinging however was done internally so I am not sure, could it be timing out while calling some external files/functions?)

Our problem was that some projects were stored on a distributed FS (BeeGFS) and some of the host nodes were offline. When CryoSPARC tried to access files on those nodes, the I/O request timed out.

If your project folder happens to be on one of the ceph systems, this error is likely associated with sporadic local NFS accessibility issues that have been plaguing your cluster.

If these errors relate to /cephfs specifically in the last week or so, it’s also likely related to the array still rebuilding in the background following several drive failures. Related to that, its near-full state has probably triggered a switch to synchronous mode, i.e. really slow.

Cheers,
Yang

Hi Yang, this is indeed the case. Have you figured out a solution for this, I guess switching the storage system can help but the other storage system in place is not backed up and only has limited space?

Thanks,
Daniel

Hi Daniel,

Unfortunately, I don’t have a way around it.

Have you found /cephfs2 to be equally unreliable recently? Issues with /cephfs can be rationalised given the rebuilding. And the recent recommendation to migrate user databases to a dedicated NVMe disk array is thought to lessen the load on the NFS metadata server and improve overall reliability. In theory, /cephfs2 should be entirely usable…

Cheers,
Yang

Hi,

I have all my data on cephfs2 now but moving it to cephfs didn’t resolve the issue. If cryosparc is calling some external files/functions I suspect there may have been a firewall update that blocks the traffic, however, it seems that cryosparc just runs locally after installation…

Daniel

It seems that actually this issue has been raised before - Ssd cache issue - #38 by wtempel

In many cases the new patch however didn’t resolve the issue, it therefore may have nothing to do with our system

My impression is that this error can pop up at any point during the job, but only when it’s attempting to read/write data from/to the project folder, suggesting that it’s a storage access issue.

FWIW, to illustrate that the array is far from performant at the moment, it has taken literally days to delete a project folder (<1TB) residing on /cephfs–the rm process is still chugging.

From a small sample size, cluster jobs accessing /cephfs2 seem to be completing fine this week. Perhaps it’s stochastic.

Cheers,
Yang