Daemon_vis restarting constantly

BrianCuttler · October 29, 2019, 8:59pm

We are running cryosparc 2.11 on Suse 13.2 with Cuda 9.1, we have 4 Titan X (Pascal) GPUs.

I’ve noted that the command_vis daemon continues to restart, the run/supervisord.log shows an unexpected exist status of 1

The commands_vis.log shows an error seeming to be rooted in the cryp master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py line 516, raising a connectionerror, max retries exceeded url /api, caused by the urllib3.connection.HTTPConnection object.

Port on local host 3005 is not replying without the 300 second timeout limit.

We seem to be running jobs, the researchers are happy, but constant daemon failures is something I assume is abnormal.

Is this actionable and if so, what is the recommended action?

thanks in advance,
Brian

stephan · November 1, 2019, 5:01pm

Hi @BrianCuttler,

Can you post the output of cryosparcm log command_vis and cryosparcm log command_core ?
Also, did you try restarting and making sure the ports are open?

BrianCuttler · November 1, 2019, 5:31pm

Sarulthasan,

master and worker are on the same node and FW is disabled.

Users are reporting that their jobs are running.

cryosparc_user@tulasi:~> telnet 127.0.0.1 39005
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused

Please let me know what other information I can gather.
thanks,
Brian

tulasi:~ # su - cryosparc_user
cryosparc_user@tulasi:~> cryosparcm log command_vis
    return session.request(method=method, url=url, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='tulasi.wadsworth.org', port=39005): Max retries exceeded with url: /api (Caused by)
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 3 of 3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc2_command/command_vis/__init__.py", line 63, in <module>
    rtp = CommandClient(os.environ['CRYOSPARC_MASTER_HOSTNAME'], int(os.environ['CRYOSPARC_COMMAND_RTP_PORT']))
  File "cryosparc2_compute/client.py", line 33, in __init__
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/cryosparc_user/cryosparc_master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='tulasi.wadsworth.org', port=39005): Max retries exceeded with url: /api (Caused by)
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://tulasi.wadsworth.org:39005/api) did not reply within timeout of 300 seconds, attempt 3 of 3

I just restarted a browser to authenticate through site FW, so if this partial command_core log shows license issues I think they are safe to ignore.

failed to connect link
failed to connect link
[EXPORT_JOB] : Request to export P15 J241
[EXPORT_JOB] :    Exporting job to /usr16/data/rzk01/cryos2/P2/P15/J241
[EXPORT_JOB] :    Exporting all of job's images in the database to /usr16/data/rzk01/cryos2/P2/P15/J241/gridfs_data...
[EXPORT_JOB] :    Writing 153 database images to /usr16/data/rzk01/cryos2/P2/P15/J241/gridfs_data/gridfsdata_0
[EXPORT_JOB] :    Done. Exported 153 images in 0.42s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.01s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Creating .csg file for particles
[EXPORT_JOB] :    Creating .csg file for volume
[EXPORT_JOB] :    Creating .csg file for mask
[EXPORT_JOB] :    Done. Exported in 0.04s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.00s
[EXPORT_JOB] : Exported P15 J241 in 0.48s
Changed job P15.J241 status completed
---------- Scheduler running ---------------
Lane  default node : Jobs Queued (nonpaused, inputs ready):  [u'J242']
Total slots:  {u'tulasi': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), u'CPU': se}
Available slots:  {u'tulasi': {u'GPU': set([0, 1, 2, 3]), u'RAM': set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), u'CPU'}
Available licen:  10000
Now trying to schedule J242
  Need slots :  {u'GPU': 1, u'RAM': 3, u'CPU': 4}
  Need fixed :  {u'SSD': True}
  Need licen :  True
  Master direct :  False
   Trying to schedule on tulasi
    Launchable:  True
    Alloc slots :  {u'GPU': [0], u'RAM': [0, 1, 2], u'CPU': [0, 1, 2, 3]}
    Alloc fixed :  {u'SSD': True}
    Alloc licen :  True
     -- Launchable! -- Launching.
---- Running project UID P15 job UID J242
failed to connect link
Error connecting to cryoSPARC license server. Checking local license file.
License Data:  {"token": "xxxxxxx", "token_valid": true, "request_date": 1571774898, "license_valid": true}
License Signature: 
     Running job on worker type node
     Running job using:  /home/cryosparc_user/cryosparc_worker/bin/cryosparcw
     Running job on remote worker node hostname tulasi
     cmd: bash -c "nohup /home/cryosparc_user/cryosparc_worker/bin/cryosparcw run --project P15 --job J242 --master_hostname tulasi.wadsworth.org --master_command_core_port 39002 > /usr16/data/rzk0"

Changed job P15.J242 status launched
---------- Scheduler done ------------------
Changed job P15.J242 status started
Changed job P15.J242 status running
failed to connect link
failed to connect link
failed to connect link
[EXPORT_JOB] : Request to export P15 J242
[EXPORT_JOB] :    Exporting job to /usr16/data/rzk01/cryos2/P2/P15/J242
[EXPORT_JOB] :    Exporting all of job's images in the database to /usr16/data/rzk01/cryos2/P2/P15/J242/gridfs_data...
[EXPORT_JOB] :    Writing 109 database images to /usr16/data/rzk01/cryos2/P2/P15/J242/gridfs_data/gridfsdata_0
[EXPORT_JOB] :    Done. Exported 109 images in 0.34s
[EXPORT_JOB] :    Exporting all job's streamlog events...
[EXPORT_JOB] :    Done. Exported 1 files in 0.01s
[EXPORT_JOB] :    Exporting job metafile...
[EXPORT_JOB] :    Creating .csg file for particles
[EXPORT_JOB] :    Creating .csg file for volume
[EXPORT_JOB] :    Creating .csg file for mask
[EXPORT_JOB] :    Done. Exported in 0.04s
[EXPORT_JOB] :    Updating job manifest...
[EXPORT_JOB] :    Done. Updated in 0.07s
[EXPORT_JOB] : Exported P15 J242 in 0.46s
Changed job P15.J242 status completed
failed to connect link

sdawood · November 1, 2019, 5:51pm

Hi @BrianCuttler,

In order for the license server to operate, you need to whitelist https://get.cryosparc.com. Can you confirm the master instance can connect to our endpoint?

Thanks,
Suhail

BrianCuttler · November 1, 2019, 6:25pm

Suhail,

To access the license server? What is the best way to confirm license acquisition?

That is why I authenticated through the browser. Our domain PCs will use identity agent, but the linux desktops have to authenticate in the browser and have a 12 hour window before they need to re-auth. I’m understanding from the research core’s technical lead that the Cryosparc license is valid for a week once acquired. All Cryosparc users have become proficient in opening the browser remotely when needed.

The directive I follow is that FW whitelist needs a vendor provided technical doc, but a quick search of the site doesn’t turn one up. If you can point me at it I will update the FW’s whitelist.

thanks,
Brian

BrianCuttler · November 4, 2019, 2:28pm

Suhail,

Thank you, but I don’t think that is the issue here. I’d like to fix license acquisition issues, but even when I have confirmed license access and restart Cryosparc the daemon continues to cycle.

thanks,
Brian

sdawood · November 5, 2019, 6:01pm

Hi Brian,

Thanks for the additional info, our team is looking into it.

Regards,
Suhail

stevew · November 26, 2019, 4:03pm

Hi,

We’re seeing the same thing here. The following lines are endlessly repeated in supervisord.log:

2019-11-26 10:42:48,418 INFO spawned: 'command_vis' with pid 9814
2019-11-26 10:42:49,420 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2019-11-26 10:43:01,319 INFO exited: command_vis (exit status 1; not expected)
2019-11-26 10:43:02,329 INFO spawned: 'command_vis' with pid 9907
2019-11-26 10:43:03,331 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2019-11-26 10:43:15,528 INFO exited: command_vis (exit status 1; not expected)
2019-11-26 10:43:16,537 INFO spawned: 'command_vis' with pid 10001
2019-11-26 10:43:17,539 INFO success: command_vis entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

And command_vis.log shows:

*** client.py: command (http://wnv.bio.purdue.edu:39005/api) did not reply within timeout of 300 seconds, attempt 1 of 3
*** client.py: command (http://wnv.bio.purdue.edu:39005/api) did not reply within timeout of 300 seconds, attempt 2 of 3
*** client.py: command (http://wnv.bio.purdue.edu:39005/api) did not reply within timeout of 300 seconds, attempt 3 of 3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc2_command/command_vis/__init__.py", line 63, in <module>
    rtp = CommandClient(os.environ['CRYOSPARC_MASTER_HOSTNAME'], int(os.environ['CRYOSPARC_COMMAND_RTP_PORT']))
  File "cryosparc2_compute/client.py", line 33, in __init__
    self._reload()
  File "cryosparc2_compute/client.py", line 61, in _reload
    system = self._get_callable('system.describe')()
  File "cryosparc2_compute/client.py", line 49, in func
    r = requests.post(self.url, data = json.dumps(data, cls=NumpyEncoder), headers = header, timeout=self.timeout)
  File "/net/jiang/home/cryosparc/cryosparc-software/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/net/jiang/home/cryosparc/cryosparc-software/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/net/jiang/home/cryosparc/cryosparc-software/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/net/jiang/home/cryosparc/cryosparc-software/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/net/jiang/home/cryosparc/cryosparc-software/cryosparc2_master/deps/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='wnv.bio.purdue.edu', port=39005): Max retries exceeded with url: /api (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7feb3c6ab510>: Failed to establish a new connection: [Errno 111] Connection refused',))

We’re running v2.11.0 of cryoSPARC. I’ve restarted cryoSPARC a couple of times and the problem persists. I know that port 39005 is open on this system since I can have netcat listen and respond to connections on the same port. In fact, when I do run netcat it picks up cryoSPARC communication:

root@wnv:~# nc -l 39005
POST /api HTTP/1.1
Host: wnv.bio.purdue.edu:39005
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.22.0
Content-Type: application/json
Content-Length: 107

{"params": [], "jsonrpc": "2.0", "method": "system.describe", "id": "b07cb6da-f60c-41cd-b43d-23b77b32a5d2"}

For the most part this doesn’t seem to cause any problems but I do need to remove a number of temporary users using icli and that depends upon this working correctly.

Please let me know if you need any additional information or if you would like me to test anything for you.

Thanks,
Steve

jcducom · December 17, 2019, 4:54am

I noticed we had same issues with RECENT updates.
This is due to an extra line that was added in
cryosparc2_master/cryosparc2_command/command_vis/init.py
[…]

connect to command core

cli = CommandClient(os.environ[‘CRYOSPARC_MASTER_HOSTNAME’], int(os.environ[‘CRYOSPARC_COMMAND_CORE_PORT’]))
#rtp = CommandClient(os.environ[‘CRYOSPARC_MASTER_HOSTNAME’], int(os.environ[‘CRYOSPARC_COMMAND_RTP_PORT’]))
rc.connect(os.environ[‘CRYOSPARC_MASTER_HOSTNAME’], int(os.environ[‘CRYOSPARC_COMMAND_CORE_PORT’]))
[…]
If I understand correctly the architecture the line with rtp should be executed if the Live version is used. In my case it was not the case. Commenting that line fixed the issue.

stephan · December 17, 2019, 4:55pm

Thanks for pointing that out! We’ll add this fix to the next release.

stevew · December 17, 2019, 6:34pm

Commenting out “rtp = CommandClient…” line in cryosparc2_master/cryosparc2_command/command_vis/init.py does indeed stop the constant restarting of the command_vis daemon.

But I am still unable to use “cryosparcm icli” which attempts to connect to port 39005 by default. Is this only available now in the Live version? Is there some other way to delete users?

Thanks,
Steve

stephan · December 17, 2019, 6:50pm

Hi @stevew,

What version of cryoSPARC are you using?
cryosparcm status

BitingChaos · January 17, 2020, 7:34pm

We’ve been having issues with this, as well!

Python’s CPU usage would spike to 100% of a core every few seconds. The logs showed command_vis having a connection error (port 39005), exiting, and then restarting, over and over.
I found this thread when trying to see if others were having the same issue!

I commented out “rtp = CommandClient” in __init__.py and the error stopped and CPU usage went back to normal. I don’t know if that breaks anything, though.

We’re also running 2.12.4, which looks current as of January 17th, 2020.

stephan · January 22, 2020, 7:24pm

Hey @BitingChaos,

This is fixed and will be released in the next version of cryoSPARC (soon!).

spunjani · January 29, 2020, 8:22pm

Hi @BitingChaos, this has been fixed in v2.13 released yesterday evening. Thanks!