Patch motion correction (multi) job failed

kortal · May 12, 2020, 1:25am

Hi, cryoSPARC administrators and users,

I came across a weird problem while doing the step 2 : patch motion correction (multi) according to the T20S tutorial. When I queued the job, it fails very quickly. If I run the submission script from the terminal, I will get the same result. What is weird is that there’s no output file generated and no error messages appear in the terminal. But if you check the status of the job, it tells that the job fails. Since no error messages, no output files, I have no idea how to fix this problem. I referred to some of the scripts in the forum, but no solutions found.

Here’s the submission script and the information from webapp. Any response would be appreciated!

#!/bin/bash

#SBATCH --account=def-supervisor
#SBATCH --job-name cryosparc_P1_J2
#SBATCH -n 6
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16000MB
#SBATCH -o /home/user/cryosparc_data/P1/J2
#SBATCH -e /home/user/cryosparc_data/P1/J2
#SBATCH --time=00:01:00

srun /home/user/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P1 --job J2 --master_hostname graham.computecanada.ca --master_command_core_port 36002 > /home/user/cryosparc_data/P1/J2/job.log 2>&1

==========================================================================

-------- Submission command:
sbatch /home/user/cryosparc_data/P1/J2/queue_sub_script.sh

-------- Cluster Job ID:
31342660

-------- Queued on cluster at 2020-05-11 18:43:42.158212

-------- Job status at 2020-05-11 18:43:42.216641
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS GRES MIN_MEM NODELIST (REASON)
31342660 user def-supervisor cryosparc_P1_J PD 1:00 6 6 gpu:p100:1 16000M (None)

Here’s the cluster_info.json file:

{
“qdel_cmd_tpl”: “scancel {{ cluster_job_id }}”,
“worker_bin_path”: “/home/user/cryosparc/cryosparc2_worker/bin/cryosparcw”,
“title”: “slurmcluster”,
“cache_path”: “/scratch/user/cryosparcsave”,
“qinfo_cmd_tpl”: “sinfo”,
“qsub_cmd_tpl”: “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl”: “squeue -j {{ cluster_job_id }}”,
“cache_quota_mb”: null,
“send_cmd_tpl”: “{{ command }}”,
“cache_reserve_mb”: 20000,
“name”: “slurmcluster”
}

nfrasser · May 12, 2020, 3:24pm

Hi @kortal,

Could you share the contents of open job log? You can click the job number on the top-left corner

You should see something like this:

Nick

kortal · May 12, 2020, 3:44pm

Hi nfrasser,

Here’s all the information of overview, there’s no error message. But if I check in the terminal, I can see if fails. There’s no output file of the job. I asked the administrator of the cluster, he said if the script is not executed, there won’t be an output file. So I think there should be something wrong with the software configuration.

kortal · May 12, 2020, 11:48pm

Hi @nfrasser ,

It’s still the problem of the submission file. By using the script below, I can run job now. But I have another problem: connection error. I guess the worker can’t connect with master although I have ~/.ssh/config file in my home directory on the cluster. The base port that is used is 36000. Do you know how to solve it? Thank you.

Here’s the job.log:

================= CRYOSPARCW =======  2020-05-12 17:46:02.451091  =========
Project P1 Job J2
Master graham.computecanada.ca Port 36002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 6211
*** client.py: command (http://graham.computecanada.ca:36002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://graham.computecanada.ca:36002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://graham.computecanada.ca:36002/api) did not reply within timeout of 300 seconds, attempt 3 of 3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Process Process-1:
Traceback (most recent call last):
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 31, in cryosparc2_compute.run.main
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 158, in cryosparc2_compute.run.run
  File "cryosparc2_compute/jobs/runcommon.py", line 89, in connect
  File "cryosparc2_compute/jobs/runcommon.py", line 89, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port))
  File "cryosparc2_compute/client.py", line 33, in __init__
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    cli = client.CommandClient(master_hostname, int(master_command_core_port))
  File "cryosparc2_compute/client.py", line 33, in __init__
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
    return session.request(method=method, url=url, **kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    resp = self.send(prep, **send_kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    r = adapter.send(request, **kwargs)
  File "/home/user/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='graham.computecanada.ca', port=36002): Max retries exceeded with url: /api (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2aeaa7a94150>: Failed to establish a new connection: [Errno 110] Connection timed out',))
    raise ConnectionError(e, request=request)
    ConnectionError: HTTPConnectionPool(host='graham.computecanada.ca', port=36002): Max retries exceeded with url: /api (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2aeaa7a95190>: Failed to establish a new connection: [Errno 110] Connection timed out',))
    *** client.py: command (http://graham.computecanada.ca:36002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
    *** client.py: command (http://graham.computecanada.ca:36002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
    *** client.py: command (http://graham.computecanada.ca:36002/api) did not reply within timeout of 300 seconds, attempt 3 of 3

Here’s the submission file I’m using.

#!/bin/bash
#SBATCH --account=def-supervisor
#SBATCH --gres=gpu:1
#SBATCH --mem=16000M
#SBATCH --time=0-00:10
#SBATCH --cpus-per-task=4

/home/user/cryosparc/cryosparc2_worker/bin/cryosparcw run --project P1 --job J2 --master_hostname graham.computecanada.ca --master_command_core_port 36002 > /home/user/cryosparc_data/P1/J2/job.log 2>&1

kortal · May 13, 2020, 1:06am

Hi @nfrasser, sovled this problem. It’s because the hostname is changing every time I log in the cluster. It works after I delete the hostname variable in the config file in cryosparc_master folder.

Thank you!

nfrasser · May 13, 2020, 8:18pm

Hi @kortal, glad you were able to figure it out! Thanks so much for letting us know.

Nick