Particle Subtraction and Local Refinement Failing

Hello,

I am trying to run a particle subtraction followed by a local refinement for my particles. Unfortunately every time I run the jobs, they will get part of the way through and then the entire CryoSPARC program will crash. I then have to go in, remove the /tmp/ file, restart the system, run the job again- and eventually after 2 or 3 tries the job will complete. I’m not sure what is happening and I would love some assistance.

I have 4 GPUs with 384 GB of memory. The box size of the particles is 512 and I have turned the SSD Cache off for these analyses because I’ve noticed that while the job goes slower, it won’t crash as much.

I’ve already changed the heartbeat to 180 seconds (when I check the config.sh file it shows the change, but in the image below it doesn’t). Here is the error I see:

I also tried to get the job log to see exactly what went wrong, but when I type in: cryosparcm job log P10 J265 (the project where the job has failed), I get the following and nothing more, no matter how long I let it sit for:

========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Waiting for data… (interrupt to abort)

@merandamasse There might be more informative contents in older portions of the log. To see those, you can direct the output to a file:
cryosparcm joblog P10 J265 > P10_J265_log1.txt
Another log of potential interest can be captured like this:
cryosparcm log command_core > core_log1.txt
Could the job specifications exceed available VRAM? What GPU models do you have?
nvidia-smi
Is there additional load on the computer that is independent of the cryoSPARC (or other) job scheduler?

Here are the results from “cryosparcm joblog P10 J265 > P10_J265log1.txt”

(base) [cryosparc_user@c107925 ~]$ cryosparcm joblog P10 J265 > P10_J265log1.txt
Traceback (most recent call last):
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ > envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/connection.py”, li ne 160, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/util/connection.py ", line 84, in create_connection
raise err
File "/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/util/connection.py ", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ > envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/connectionpool.py” , line 677, in urlopen
chunked=chunked,
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/connectionpool.py” , line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ Preformatted > text
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/http/client.py”, line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/http/client.py”, line 1323, in _send_req > uest
self.endheaders(body, encode_chunked=encode_chunked)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/
envs/cryosparc_master_env/lib/python3.7/http/client.py”, line 1272, in endheader s
self._send_output(message_body, encode_chunked=encode_chunked)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/http/client.py”, line 1032, in _send_out put
self.send(msg)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ > envs/cryosparc_master_env/lib/python3.7/http/client.py”, line 972, in send
self.connect()
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/connection.py”, li ne 187, in connect
conn = self._new_conn()
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/connection.py”, li ne 172, in _new_conn
self, “Failed to establish a new connection: %s” % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fee5a3c44d0>: Failed to establish a new connection: [Errno 111] Connectio n refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/requests/adapters.py”, lin e 449, in send
timeout=timeout
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/ envs/cryosparc_master_env/lib/python3.7/site-packages/urllib3/connectionpool.py” , line 727, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryospar> c_master_env/lib/python3.7/site-packages/urllib3/util/retry.py”, line 446, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘c107925’, port=39002): Max retries exceeded with url: /api (Caused by NewConnectionError(’<urllib3.connection.HTTPConnection object at 0x7fee5a3c44d0>: > Failed to establish a new connection: [Errno 111] Connection refused’))

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File > “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/cryosparc_compute/client.py”, line 90, in
cli = CommandClient(host, int(port))
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/cryosparc_compute/client.py”, line 40, in init
self._reload()
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/cryosparc_compute/client.py”, line 68, in _reload
system = self._get_callable(‘system.describe’)()
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/cryosparc_compute/client.py”, line 56, in func
r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, > timeout=self.timeout)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/requests/api.py”, line 119, in post
return request(‘post’, url, data=data, json=json, **kwargs)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/requests/api.py”, line 61, in request
return session.request(method=method, url=url, **kwargs)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/requests/sessions.py”, line 530, in request
resp = self.send(prep, **send_kwargs)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/requests/sessions.py”, line 643, in send
r = adapter.send(request, **kwargs)
File “/home/cryosparc_user/software/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/requests/adapters.py”, line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host=‘c107925’, port=39002): Max retries exceeded with url: /api (Caused by NewConnectionError(’<urllib3.connection.HTTPConnection object at 0x7fee5a3c44d0>: Failed to establish a new connection: [Errno 111] Connection refused’))

This command didn’t show anything.

No file core_log1.txt has been created?

I guess not. When I typed it in, this is what it looked like:

Where I tried to type it in twice since it just didn’t do anything. So then I just typed in:

nvidia-smi

because I figured it wasn’t going to work.

the > in the command indicates it should be redirecting the output of cryosparcm log command_core to a text file, core_log1.txt. You won’t see anything unless you look at that file - you’ll want to check (with e.g. vi or gedit if it has been successfully created and there is text in it.

Cheers
Oli

2 Likes