Multiple errors while synced SSD Cache in CS v4.2.1

Dear all,

we have been experience multiple crashes in different jobs while loading SSD cache from a NAS system via network. The job usually finishes with an error after 5 min: “socket.timeout: timed out. or tp_open
return self.do_open(http.client.HTTPConnection, req)”

I found, if scratch directory is deleted and a reboot then only 1 or 2 jobs will run, which is not very convenient because after few jobs similar error pop-up again. I will appreciate your suggestions.

Please find below the log file:

[2023-05-08 10:00:42.47]
License is valid.
[2023-05-08 10:00:42.47]
Launching job on lane RTX-A5500 target michaelscott ...
[2023-05-08 10:00:42.57]
Running job on master node hostname michaelscott
[2023-05-08 10:00:46.26]
[CPU:  195.2 MB  Avail:1019.75 GB]
Job J1087 Started
[2023-05-08 10:00:46.29]
[CPU:  195.4 MB  Avail:1019.75 GB]
Master running v4.2.1, worker running v4.2.1
[2023-05-08 10:00:46.30]
[CPU:  195.4 MB  Avail:1019.75 GB]
Working in directory: /home/cryosparc/working_directory/CS-gamma-turc/J1087
[2023-05-08 10:00:46.31]
[CPU:  195.4 MB  Avail:1019.75 GB]
Running on lane RTX-A5500
[2023-05-08 10:00:46.31]
[CPU:  195.4 MB  Avail:1019.75 GB]
Resources allocated: 
[2023-05-08 10:00:46.31]
[CPU:  195.4 MB  Avail:1019.75 GB]
  Worker:  michaelscott
[2023-05-08 10:00:46.32]
[CPU:  195.4 MB  Avail:1019.75 GB]
  CPU   :  [0, 1, 2, 3]
[2023-05-08 10:00:46.32]
[CPU:  195.4 MB  Avail:1019.75 GB]
  GPU   :  [0]
[2023-05-08 10:00:46.32]
[CPU:  195.4 MB  Avail:1019.75 GB]
  RAM   :  [0, 1, 2]
[2023-05-08 10:00:46.32]
[CPU:  195.4 MB  Avail:1019.75 GB]
  SSD   :  True
[2023-05-08 10:00:46.33]
[CPU:  195.4 MB  Avail:1019.75 GB]
--------------------------------------------------------------
[2023-05-08 10:00:46.33]
[CPU:  195.4 MB  Avail:1019.75 GB]
Importing job module for job type new_local_refine...
[2023-05-08 10:00:48.42]
[CPU:  264.2 MB  Avail:1019.69 GB]
Job ready to run
[2023-05-08 10:00:48.42]
[CPU:  264.2 MB  Avail:1019.69 GB]
***************************************************************
[2023-05-08 10:00:52.37]
[CPU:  736.1 MB  Avail:1019.21 GB]
Using random seed of 1093838622
[2023-05-08 10:00:52.38]
[CPU:  740.7 MB  Avail:1019.20 GB]
Loading a ParticleStack with 45426 items...
[2023-05-08 10:00:55.13]
[CPU:  741.0 MB  Avail:1019.15 GB]
 SSD cache : cache successfully synced in_use
[2023-05-08 10:00:56.14]
[CPU:  741.0 MB  Avail:1019.14 GB]
 SSD cache : cache successfully synced, found 0.00MB of files on SSD.
[2023-05-08 10:05:56.41]
[CPU:  290.5 MB  Avail:1019.65 GB]
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/local_refine/newrun.py", line 123, in cryosparc_compute.jobs.local_refine.newrun.run_local_refine
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/particles.py", line 114, in read_blobs
    u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 112, in download_and_return_cache_paths
    compressed_keys = get_compressed_keys(worker_hostname, rel_paths)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/cache.py", line 285, in get_compressed_keys
    compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths))
  File "/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/cryosparc/cryosparc_worker/cryosparc_tools/cryosparc/command.py", line 165, in make_request
    with urlopen(request, timeout=client._timeout) as response:
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out
05m 07s
RTX-A5500
1
NVIDIA RTX A5500
3 Output Groups

What are you actual data transfer speeds from the NAS? How long does it take to move a micrograph from the NAS to the workstation?

Also, have you tried increasing CRYOSPARC_CLIENT_TIMEOUT in cryosparc_master/config.sh?

Welcome to the forum @hugomh .
Please can you install the 230427 patch for CryoSPARC v4.2.1 and check whether the problem persists.

Well… I have tried both of your proposed ideas @ccgauvin94 and @wtempel … but again similar error. :frowning:

  1. Install the patch
cryosparc@michaelscott:~$ cryosparcm patch --force

A cryoSPARC patch is available

    Current Version: v4.2.1
    Current Patch: 230427
    New Patch: 230427
    Released On: 2023-04-28 16:51:03
    Requires Restart: Yes
    Patch Notes:
	- Fixed: Correct retrieval of legacy cluster configurations
	- Fixed: Live no longer stops finding new exposures
	- Fixed: SSD Cache system correctly retries up to 3 times on network timeout
	- Fixed: Correct free SSD cache storage calculation to prevent infinite cache hang when there is enough available SSD storage

Install patch? (y/n): y
Downloading...
Downloading cryosparc_master_patch.tar.gz...
Downloading cryosparc_worker_patch.tar.gz...
Patching master...
Done.
Gathering worker info...
Patch 2 workers? (y/n): y
Patching workers...
All workers: 
   michaelscott cryosparc@michaelscott
   dwight cryosparc@dwight
=================================================
Updating worker michaelscott: Direct update
cp -f cryosparc_worker_patch.tar.gz /home/cryosparc/cryosparc_worker/cryosparc_worker_patch.tar.gz
/home/cryosparc/cryosparc_worker/bin/cryosparcw patch
Worker patch successfully applied.
 -------------------------------------------------
Updating worker dwight: Remote update
scp cryosparc_worker_patch.tar.gz cryosparc@dwight:/home/cryosparc/cryosparc_worker/cryosparc_worker_patch.tar.gz
ssh cryosparc@dwight /home/cryosparc/cryosparc_worker/bin/cryosparcw patch
Worker patch successfully applied.
 -------------------------------------------------
=================================================
Patched 2 workers
Finishing...
Done.
Patch v4.2.1+230427 applied!
To complete installation, restart cryoSPARC with the following command:

    cryosparcm restart
  1. And increase 10 times the time of the master config file. (which is in MS)
#export CRYOSPARC_MASTER_HOSTNAME="10.253.128.60"
export CRYOSPARC_FORCE_HOSTNAME=true
export CRYOSPARC_DB_PATH="/home/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=61000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000

# Security
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true

# Cluster Integration
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000

# Project Configuration
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'

# Development
export CRYOSPARC_DEVELOP=false

# Other
export CRYOSPARC_CLICK_WRAP=true

~

Here error in job 1010:

The funny thing previous job worked…