Unable to restart cryosparc on HPC

Hello,

I have been using cryoSPARC v3.2 on an HPC where I have to start the master instance each time I re-connect to the cluster. This has been working fine until I was disconnected from the node which was running the master instance and cryoSPARC crashed and I am unable to restart cryoSPARC. The startup process seems to be stuck after “command_core: started”. Here is the log from cryosparcm log command_core

  • COMMAND CORE STARTED === 2021-11-13 01:48:07.015774 ==========================
    *** BG WORKER START
  • Serving Flask app “command_core” (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/ (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x148d72a64c10>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/ (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x148d72a64190>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/ (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x148d72a64f90>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/ (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x148d72a64fd0>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/ (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x148d72a641d0>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/ (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x148d72a64a90>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.
    COMMAND CORE STARTED === 2021-11-13 01:54:09.638931 ==========================
    *** BG WORKER START
  • Serving Flask app “command_core” (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    [JSONRPC ERROR 2021-11-13 01:54:14.647990 at get_config_var ]

**custom thread exception hook caught something
**** handle exception rc
Traceback (most recent call last):
File “/path/to/cryosparc/cryosparc_master/cryosparc_compute/jobs/runcommon.py”, line 1790, in run_with_except_hook
run_old(*args, **kw)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “/path/to/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 208, in background_worker
last_audit_date = get_config_var(‘audit’, fail_notset=False, default={})
File “/lpath/to/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 140, in wrapper
raise e
File “/path/to/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 131, in wrapper
res = func(*args, **kwargs)
File “/lpath/to/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 550, in get_config_var
res = mongo.db[colname].find_one({‘name’ : name})
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/collection.py”, line 1319, in find_one
for result in cursor.limit(-1):
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py”, line 1207, in next
if len(self.__data) or self._refresh():
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py”, line 1124, in _refresh
self.__send_message(q)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py”, line 1001, in __send_message
address=self.__address)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py”, line 1372, in _run_operation_with_response
exhaust=exhaust)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py”, line 1471, in _retryable_read
return func(session, server, sock_info, slave_ok)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py”, line 1366, in _cmd
unpack_res)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/server.py”, line 137, in run_operation_with_response
first, sock_info.max_wire_version)
File “/path/to/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/helpers.py”, line 140, in _check_command_response
raise NotMasterError(errmsg, response)
pymongo.errors.NotMasterError: node is not in primary or recovering state, full error: {‘ok’: 0.0, ‘errmsg’: ‘node is not in primary or recovering state’, ‘code’: 13436, ‘codeName’: ‘NotMasterOrSecondary’}

So far I have checked for orphaned cryosparc processes and killed those. I am able to ping get.cryosparc.com, so it seems the node can connect the license server. I have also re-installed cryoSPARC and the same issue persists. I am not sure what else to try.

Thanks,
Udit

Hi @udalwadi, we’ve seen this happen when there’s an issue with the cryoSPARC database. I suggest cryosparcm fixdb command, which requires the latest cryoSPARC patch.

Follow these instructions to install the patch when cryoSPARC is not running: Node is not in primary or recovering state

Try that out and let me know how it goes.

Hi @nfrasser, I followed the instructions and managed to patch cryoSPARC and run the fixdbport command - all of which seemed to go fine. Upon trying cryosparcm restart, the startup is still stuck after “command_core: started”.

Here is the latest output to cryosparcm log command_core:

COMMAND CORE STARTED === 2021-11-15 15:14:26.149045 ==========================
*** BG WORKER START

  • Serving Flask app “command_core” (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    HTTPSConnectionPool(host=‘get.cryosparc.com’, port=443): Max retries exceeded with url: /heartbeat/(removed) (Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x1468a760a990>: Failed to establish a new connection: [Errno -2] Name or service not known’)))
    Error connecting to cryoSPARC license server during instance heartbeat.

and the output to cryosparcm log database:

/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/mii/1.1.1/bin/mii: invalid option – ‘t’
/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/mii/1.1.1/bin/mii: invalid option – ‘a’
/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/mii/1.1.1/bin/mii: invalid option – ‘i’
/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/mii/1.1.1/bin/mii: invalid option – ‘l’
[15:18:03] ERROR select: missing argument

USAGE: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/mii/1.1.1/bin/mii [FLAGS] [OPTIONS]

FLAGS:
-j, --json Output results in JSON encoding
-h, --help Show this message
-v, --version Show Mii build version

OPTIONS:
-d, --datadir Use to store index data
-m, --modulepath Use instead of $MODULEPATH

SUBCOMMANDS:
build Regenerate the module index
sync Update the module index
exact Find modules which provide
search Search for commands similar to
show Show commands provided by
list List all cached module files
install Install mii into your shell
enable Enable mii integration (default)
disable Disable mii integration
status Get database and integration status
version Show Mii build version
help Show this message

The cluster I am using does not have internet access by default, but the support team has set up exceptions to access get.cryosparc.com. My current guess is that this exception has stopped working and I can’t connect to the license server (despite the connection being fine just minutes before the error occurring) - I am getting help from them now. Let me know if you have more insight based on this information.

Thanks,
Udit