Cryosparc2 cluster install

Hi,

I am trying to get cryosparc2 working on a cluster with slurm. Job gets submitted but it is not being executed.

Launching job on lane TEST_CLUSTER target TEST_CLUSTER ...
License is valid.
Launching job on cluster TEST_CLUSTER

====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P1_J8
#SBATCH --partition=sbatch
#SBATCH --output=/data/cryosparc_user/projects/cryosparc2/example/T20S/P1/J8/job.log
#SBATCH --error=/data/cryosparc_user/projects/cryosparc2/example/T20S/P1/J8/job.log
#SBATCH --nodes=1
#SBATCH --mem=16000M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
srun /data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/bin/cryosparcw run --project P1 --job J8 --master_hostname cryosparc --master_command_core_port 39002 > /data/cryosparc_user/projects/cryosparc2/example/T20S/P1/J8/job.log 2>&1 
==========================================================================
==========================================================================
-------- Submission command: 
sbatch /data/cryosparc_user/projects/cryosparc2/example/T20S/P1/J8/queue_sub_script.sh
-------- Cluster Job ID: 
5991
-------- Queued at 2019-06-18 15:32:19.844864
-------- Job status at 2019-06-18 15:32:19.866773
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              5991    sbatch cryospar  cryosparc_user PD       0:00      1 (None)


Here is the out.log:
================= CRYOSPARCW =======  2019-06-18 15:32:20.445241  =========
Project P1 Job J8
Master cryosparc Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 17654
*** client.py: command (http://cryosparc:39002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://cryosparc:39002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://cryosparc:39002/api) did not reply within timeout of 300 seconds, attempt 3 of 3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Process Process-1:
Traceback (most recent call last):
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 148, in cryosparc2_compute.run.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/run.c:5181)
    self.run()
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 31, in cryosparc2_compute.run.main (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/run.c:2121)
  File "cryosparc2_compute/jobs/runcommon.py", line 70, in connect
  File "cryosparc2_compute/jobs/runcommon.py", line 70, in connect
    cli = client.CommandClient(master_hostname, int(master_command_core_port))
  File "cryosparc2_compute/client.py", line 33, in __init__
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    cli = client.CommandClient(master_hostname, int(master_command_core_port))
  File "cryosparc2_compute/client.py", line 33, in __init__
    return request('post', url, data=data, json=json, **kwargs)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    return session.request(method=method, url=url, **kwargs)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    resp = self.send(prep, **send_kwargs)
    return request('post', url, data=data, json=json, **kwargs)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    r = adapter.send(request, **kwargs)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    resp = self.send(prep, **send_kwargs)
    raise ConnectionError(e, request=request)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
requests.exceptions.ConnectionError: HTTPConnectionPool(host='cryosparc', port=39002): Max retries exceeded with url: /api (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2ae02da26150>: Failed to establish a new connection: [Errno 113] No route to host',))
    r = adapter.send(request, **kwargs)
  File "/data/cryosparc_user/progs/cryosparc2/cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='cryosparc', port=39002): Max retries exceeded with url: /api (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2ae02da27190>: Failed to establish a new connection: [Errno 113] No route to host',))
*** client.py: command (http://cryosparc:39002/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://cryosparc:39002/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://cryosparc:39002/api) did not reply within timeout of 300 seconds, attempt 3 of 3
srun: error: gpu2: task 0: Exited with exit code 1

Opening up the relevant ports on master resolves the issue.

BC

Hi bcanax,

I got the similiar prompt when I register worker with master. It seems worker can’t connect with master. I think it is also the port problem. May I ask how do you open the port of master for worker to connect with?

Thank you!