Ssd cache issue

yodamoppet · October 26, 2022, 4:31pm

Greetings.

We have a cluster with local SSD cache on each GPU node. Lately with some jobs, we are getting an error that appears to be relating to caching, but it is only with some jobs.

The job looks like it will start and gets to the SSD cache step:

[CPU: 68.2 MB] Project P119 Job J14 Started

[CPU: 68.2 MB] Master running v3.3.1, worker running v3.3.1

[CPU: 68.7 MB] Working in directory: /tank/colemanlab/jcoleman/cryosparc/P119/J14

[CPU: 68.7 MB] Running on lane vision

[CPU: 68.7 MB] Resources allocated:

[CPU: 68.7 MB] Worker: vision

[CPU: 68.7 MB] CPU : [0, 1, 2, 3]

[CPU: 68.7 MB] GPU : [0]

[CPU: 68.7 MB] RAM : [0, 1, 2]

[CPU: 68.7 MB] SSD : True

[CPU: 68.7 MB] --------------------------------------------------------------

[CPU: 68.7 MB] Importing job module for job type nonuniform_refine_new...

[CPU: 228.8 MB] Job ready to run

[CPU: 228.8 MB] ***************************************************************

[CPU: 601.7 MB] Using random seed of 1566521779

[CPU: 601.7 MB] Loading a ParticleStack with 412763 items...

[CPU: 604.9 MB] SSD cache : cache successfuly synced in_use

[CPU: 605.0 MB] SSD cache : cache successfuly synced, found 0.00MB of files on SSD.

Eventually, the job times out with this error:

**CPU: 605.2 MB]** **Traceback (most recent call last): File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request six.raise_from(e, None) File "<string>", line 3, in raise_from File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request httplib_response = conn.getresponse() File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 1369, in getresponse response.begin() File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 310, in begin version, status, reason = self._read_status() File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 271, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/util/retry.py", line 410, in increment raise six.reraise(type(error), error, _stacktrace) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise raise value File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen chunked=chunked, File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 428, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout self, url, "Read timed out. (read timeout=%s)" % timeout_value urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='vision.structbio.pitt.edu', port=39002): Read timed out. (read timeout=300) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main File "cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py", line 125, in cryosparc_compute.jobs.refine.newrun.run_homo_refine File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/particles.py", line 88, in read_blobs u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths) File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/cache.py", line 119, in download_and_return_cache_paths compressed_keys = rc.cli.cache_request_check(worker_hostname, rc._project_uid, rc._job_uid, com.compress_paths(rel_paths)) File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/client.py", line 56, in func r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/api.py", line 119, in post return request('post', url, data=data, json=json, **kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/adapters.py", line 529, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='vision.structbio.pitt.edu', port=39002): Read timed out. (read timeout=300)**

wtempel · October 27, 2022, 3:48pm

It may be time for a full reset of the caching system (details).

yodamoppet · November 8, 2022, 4:15pm

Greetings,

We reset the cache as described in the full cache reset.

We’re still having this cache issue though. Below is the output from a recent job.

Jobs that do not use cache are running fine.

CPU: 391.0 MB]  Traceback (most recent call last):
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request
        six.raise_from(e, None)
   File "<string>", line 3, in raise_from
   File
    "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
        httplib_response = conn.getresponse()
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 1369, in getresponse
        response.begin()
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 310, in begin
        version, status, reason = self._read_status()
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/http/client.py", line 271, in _read_status
        line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/socket.py", line 589, in readinto
        return self._sock.recv_into(b) socket.timeout: timed out 
During handling of the above exception, another exception occurred: Traceback (most recent call last):
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
        timeout=timeout
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen
        method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/util/retry.py", line 410, in increment
        raise six.reraise(type(error), error, _stacktrace)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
        raise value
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
        chunked=chunked,
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 428, in _make_request
        self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
        self, url, "Read timed out. (read timeout=%s)" % timeout_value urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='vision.structbio.pitt.edu', port=39002): Read timed out. (read timeout=300) 
During handling of the above exception, another exception occurred: Traceback (most recent call last):
   File "cryosparc_worker/cryosparc_compute/run.py", line 85, in cryosparc_compute.run.main
   File "cryosparc_worker/cryosparc_compute/jobs/local_refine/newrun.py", line 117, in cryosparc_compute.jobs.local_refine.newrun.run_local_refine
   File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/particles.py", line 88, in read_blobs
        u_blob_paths = cache.download_and_return_cache_paths(u_rel_paths)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/jobs/cache.py", line 111, in download_and_return_cache_paths
        rc.cli.cache_sync_in_use(worker_hostname, rc._project_uid, rc._job_uid) # ignore self
   File "/opt/cryoem/cryosparc/cryosparc2_worker/cryosparc_compute/client.py", line 54, in func
        r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/api.py", line 119, in post
        return request('post', url, data=data, json=json, **kwargs)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
        resp = self.send(prep, **send_kwargs)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
        r = adapter.send(request, **kwargs)
   File "/opt/cryoem/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
        raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='vision.structbio.pitt.edu', port=39002): Read timed out. (read timeout=300)

wtempel · November 8, 2022, 4:30pm

Is there a pattern related to job types or worker hosts?

yodamoppet · November 8, 2022, 5:04pm

This appears to be happening on all worker nodes.

As far s jobs, users are reporting that all current job submissions have this problem. I’m not sure if that means all jobs, but it seems to be for our purposes.

wtempel · November 9, 2022, 6:33pm

I wonder whether this could be an issue related to strain on the master host or the network:

how much RAM does the master host have
how many concurrent jobs across the instance
how many active users
size of the $CRYOSPARC_DB_PATH directory
type of storage medium and connection to $CRYOSPARC_DB_PATH
type of connection between master and worker nodes

yodamoppet · November 9, 2022, 6:55pm

I don’t think it’s related. It happens whether we’re running 1 job or 10 jobs, and the SSD node cache works fine for other software (Relion, etc).

I will answer your questions here:

256GB RAM on master host
usually between 1 and 8 jobs total. It doesn’t matter for caching though.
8 active users, usually not all are running jobs at the same time
cryosparc DB directory is 375GB at current
storage is clustered BeeGFS storage, all BeeGFS checks are normal/optimal.
All interconnects are EDR Infiniband

nfrasser · November 10, 2022, 10:18pm

Hi @yodamoppet, thanks for sending that over. I’ve been looking into your issue and need to enable some additional logging on your instance to help determine what’s causing this.

I’ve prepared a modified Python file for our command server that enables this logging. Please download the copy for your version of CryoSPARC and install it with the following:

Stop the command_core service with command cryosparcm stop command_core
Back up cryosparc_master/cryosparc_command/command_core/__init__.py to __init__.py.bak
Extract the __init__.py in the zip file into the cryosparc_master/cryosparc_command/command_core/ directory
Start the service with command cryosparcm start command_core

The next time you encounter the cache error, get the logs with this command:

cryosparcm log command_core > command_core.log

And direct-message the resulting command_core.log file to myself and @wtempel.

Let me know if you have any trouble with the above.

Download links

For CryoSPARC v3: v3__init__.py.zip - Google Drive
For CryoSPARC v4: v4__init__.py.zip - Google Drive

yodamoppet · November 14, 2022, 6:41pm

Very good, I will do this and report back.

Does stopping the command_core stop currently running jobs? Just trying to determine if I need to schedule some downtime or can do this anytime.

yodamoppet · December 13, 2022, 2:41pm

I increased logging as directed, but also the SSD cache has resumed working on this system after another restart of cryosparcm.

Jason · March 17, 2023, 4:30pm

We have been experiencing the same issue now for several weeks. Has a general solution been identified? We are happy to provide additional logs if that helps.

Thanks,
Jason

Jason · March 17, 2023, 4:33pm

We are experiencing a similar issue and I’m wondering if you could provide the optimal (and maximum) values for the questions you asked.

Thank you,
Jason

wtempel · March 17, 2023, 8:40pm

Thanks for your kind offer. Would you be willing to

update to latest release, to ensure the latest caching and logging implementations
when you observe the problem on v4.2.1, capture instance logs with
cryosparcm snaplogs
and email us log archive?

Jason · March 17, 2023, 8:52pm

I updated to the latest version a few days ago and the problem persists. I’ll create the error now and run the command. Which email address should I send the logs to?

Thanks
Jason

wtempel · March 20, 2023, 9:12pm

@Jason
Thanks for sending us the logs.
Do you experience the problem only with jobs for which caching was enabled?
Do you experience the problem for all jobs for which caching was enabled?
Does the problem persist after a
cryosparcm restart?
Does the problem persist after resetting the cache?

Jason · March 20, 2023, 9:27pm

Yes, the timeout problem is only when we enable SSD caching.
Yes, it seems to happen all jobs for which caching was enabled (2D, ab initio, all 3D refinements, etc…)
Yes, the problem persists after a restart.
I have not tried steps 3-6 in resetting the cache (Option 3). I have manually deleted the cache, and that doesn’t seem to help. I can try to also to clear out the cache records.

-Jason

Bassem · March 20, 2023, 10:25pm

We are having similar problems with v4.2.1

I reset the cache per the link and that did the trick for few minutes and jobs started faster then, we started getting

[CPU:  861.8 MB  Avail: 162.06 GB]
 SSD cache :   Cache full for past 92s
                  free: 2608136MB, total: 3513569MB (reserve: 10000MB)
                 There may be other large files pending download preventing this job from using the cache
                  Checking again in 30 seconds for space to become available...

Bassem · March 21, 2023, 6:25pm

reverted to v4.2.0 plus patch…it is behaving again.

wtempel · March 22, 2023, 3:34pm

Thanks @Bassem for reporting your observation.

wtempel · March 22, 2023, 3:35pm

@Jason Please let us know your findings once you have tried that.