Greetings.
We have a cluster with local SSD cache on each GPU node. Lately with some jobs, we are getting an error that appears to be relating to caching, but it is only with some jobs.
The job looks like it will start and gets to the SSD cache step:
[CPU: 68.2 MB] Project P119 Job J14 Started
[CPU: 68.2 MB] Master running v3.3.1, worker running v3.3.1
[CPU: 68.7 MB] Working in directory: /tank/colemanlab/jcoleman/cryosparc/P119/J14
[CPU: 68.7 MB] Running on lane vision
[CPU: 68.7 MB] Resources allocated:
[CPU: 68.7 MB] Worker: vision
[CPU: 68.7 MB] CPU : [0, 1, 2, 3]
[CPU: 68.7 MB] GPU : [0]
[CPU: 68.7 MB] RAM : [0, 1, 2]
[CPU: 68.7 MB] SSD : True
[CPU: 68.7 MB] --------------------------------------------------------------
[CPU: 68.7 MB] Importing job module for job type nonuniform_refine_new...
[CPU: 228.8 MB] Job ready to run
[CPU: 228.8 MB] ***************************************************************
[CPU: 601.7 MB] Using random seed of 1566521779
[CPU: 601.7 MB] Loading a ParticleStack with 412763 items...
[CPU: 604.9 MB] SSD cache : cache successfuly synced in_use
[CPU: 605.0 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.
Eventually, the job times out with this error:
**CPU: 605.2 MB]** **Traceback (most recent call last): File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request six.raise_from(e, None) File "<string>", line 3, in raise_from File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request httplib_response = conn.getresponse() File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 1369, in getresponse response.begin() File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 310, in begin version, status, reason = self._read_status() File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 271, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/util/retry.py", line 410, in increment raise six.reraise(type(error), error, _stacktrace) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise raise value File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen chunked=chunked, File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 428, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout self, url, "Read timed out. (read timeout=%s)" % timeout_value urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='vision.structbio.pitt.edu', port=39002): Read timed out. (read timeout=300) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main File "cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py", line 125, in cryosparc_compute.jobs.refine.newrun.run_homo_refine File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/particles.py", line 88, in read_blobs u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths) File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/cache.py", line 119, in download_and_return_cache_paths compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths)) File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/client.py", line 56, in func r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/api.py", line 119, in post return request('post', url, data=data, json=json, **kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/adapters.py", line 529, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='vision.structbio.pitt.edu', port=39002): Read timed out. (read timeout=300)**