Daemon_vis restarting constantly

open

#1

We are running cryosparc 2.11 on Suse 13.2 with Cuda 9.1, we have 4 Titan X (Pascal) GPUs.

I’ve noted that the command_vis daemon continues to restart, the run/supervisord.log shows an unexpected exist status of 1

The commands_vis.log shows an error seeming to be rooted in the cryp master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py line 516, raising a connectionerror, max retries exceeded url /api, caused by the urllib3.connection.HTTPConnection object.

Port on local host 3005 is not replying without the 300 second timeout limit.

We seem to be running jobs, the researchers are happy, but constant daemon failures is something I assume is abnormal.

Is this actionable and if so, what is the recommended action?

thanks in advance,
Brian


#2

Hi @BrianCuttler,

Can you post the output of cryosparcm log command_vis and cryosparcm log command_core ?
Also, did you try restarting and making sure the ports are open?


#3

Sarulthasan,

master and worker are on the same node and FW is disabled.

Users are reporting that their jobs are running.

cryosparc_user@tulasi:~> telnet 127.0.0.1 39005
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused

Please let me know what other information I can gather.
thanks,
Brian

tulasi:~ # su - cryosparc_user
cryosparc_user@tulasi:~> cryosparcm log command_vis
    return session.request(method=method, url=url, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='tulasi.wadsworth.org', port=39005): Max retries exceeded with url: /api (Caused by)
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 3 of 3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc2_command/command_vis/__init__.py", line 63, in <module>
    rtp = CommandClient(os.environ['CRYOSPARC_MASTER_HOSTNAME'], int(os.environ['CRYOSPARC_COMMAND_RTP_PORT']))
  File "cryosparc2_compute/client.py", line 33, in __init__
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='tulasi.wadsworth.org', port=39005): Max retries exceeded with url: /api (Caused by)
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 3 of 3

I just restarted a browser to authenticate through site FW, so if this partial command_core log shows license issues I think they are safe to ignore.

failed to connect link
failed to connect link
[EXPORT_JOB] : Request to export P15 J241
[EXPORT_JOB] :    Exporting job to /usr16/data/rzk01/cryos2/P2/P15/J241
[EXPORT_JOB] :    Exporting all of job's images in the database to /usr16/data/rzk01/cryos2/P2/P15/J241/gridfs_data...
[EXPORT_JOB] :    Writing 153 database images to /usr16/data/rzk01/cryos2/P2/P15/J241/gridfs_data/gridfsdata_0
[EXPORT_JOB] :    Done. Exported 153 images in 0.42s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.01s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Creating .csg file for particles
[EXPORT_JOB] :    Creating .csg file for volume
[EXPORT_JOB] :    Creating .csg file for mask
[EXPORT_JOB] :    Done. Exported in 0.04s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P15 J241 in 0.48s
Changed job P15.J241 status completed
---------- Scheduler running ---------------
Lane  default node : Jobs Queued (nonpaused, inputs ready):  [u'J242']
Total slots:  {u'tulasi': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), u'CPU': se}
Available slots:  {u'tulasi': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), u'CPU'}
Available licen:  10000
Now trying to schedule J242
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Need licen :  True
  Master direct :  False
   Trying to schedule on tulasi
    Launchable:  True
    Alloc slots :  {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}
    Alloc fixed :  {u'SSD': True}
    Alloc licen :  True
     -- Launchable! -- Launching.
---- Running project UID P15 job UID J242
failed to connect link
Error connecting to cryoSPARC license server. Checking local license file.
License Data:  {"token": "xxxxxxx", "token_valid": true, "request_date": 1571774898, "license_valid": true}
License Signature: 
     Running job on worker type node
     Running job using:  /home/cryosparc_user/cryosparc_worker/bin/cryosparcw
     Running job on remote worker node hostname tulasi
     cmd: bash -c "nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P15 --job J242 --master_hostname tulasi.wadsworth.org --master_command_core_port 39002 > /usr16/data/rzk0"

Changed job P15.J242 status launched
---------- Scheduler done ------------------
Changed job P15.J242 status started
Changed job P15.J242 status running
failed to connect link
failed to connect link
failed to connect link
[EXPORT_JOB] : Request to export P15 J242
[EXPORT_JOB] :    Exporting job to /usr16/data/rzk01/cryos2/P2/P15/J242
[EXPORT_JOB] :    Exporting all of job's images in the database to /usr16/data/rzk01/cryos2/P2/P15/J242/gridfs_data...
[EXPORT_JOB] :    Writing 109 database images to /usr16/data/rzk01/cryos2/P2/P15/J242/gridfs_data/gridfsdata_0
[EXPORT_JOB] :    Done. Exported 109 images in 0.34s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.01s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Creating .csg file for particles
[EXPORT_JOB] :    Creating .csg file for volume
[EXPORT_JOB] :    Creating .csg file for mask
[EXPORT_JOB] :    Done. Exported in 0.04s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.07s
[EXPORT_JOB] : Exported P15 J242 in 0.46s
Changed job P15.J242 status completed
failed to connect link

#4

Hi @BrianCuttler,

In order for the license server to operate, you need to whitelist https://get.cryosparc.com. Can you confirm the master instance can connect to our endpoint?

Thanks,
Suhail


#5

Suhail,

To access the license server? What is the best way to confirm license acquisition?

That is why I authenticated through the browser. Our domain PCs will use identity agent, but the linux desktops have to authenticate in the browser and have a 12 hour window before they need to re-auth. I’m understanding from the research core’s technical lead that the Cryosparc license is valid for a week once acquired. All Cryosparc users have become proficient in opening the browser remotely when needed.

The directive I follow is that FW whitelist needs a vendor provided technical doc, but a quick search of the site doesn’t turn one up. If you can point me at it I will update the FW’s whitelist.

thanks,
Brian


#6

Suhail,

Thank you, but I don’t think that is the issue here. I’d like to fix license acquisition issues, but even when I have confirmed license access and restart Cryosparc the daemon continues to cycle.

thanks,
Brian


#7

Hi Brian,

Thanks for the additional info, our team is looking into it.

Regards,
Suhail