Frequent mongodb crashes

Hi,

we have a standalone cryosparc server running multiple cryosparc instances (due to access policies / user permissions set by institute IT). In the past this has worked fine without any major issues. Recently instances have been crashing frequently with the webapp either displaying an infinite loading bar or a 503 error.
Cryosparc (via cryosparcm status) will typically give a status similar to the following:

Current cryoSPARC version: v3.3.2
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 147261, uptime 23:00:53
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 31148, uptime 0:44:04
command_rtp                      RUNNING   pid 147166, uptime 23:01:03
command_vis                      RUNNING   pid 64164, uptime 0:00:20
database                         EXITED    May 09 11:29 PM
liveapp                          STOPPED   Not started
liveapp_dev                      STOPPED   Not started
webapp                           RUNNING   pid 147250, uptime 23:00:54
webapp_dev                       STOPPED   Not started

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

or seemingly less frequently something like this:

Current cryoSPARC version: v3.3.2
----------------------------------------------------------------------------

CryoSPARC process status:

app                              FATAL     Exited too quickly (process log may have details)
app_dev                          STOPPED   Not started
command_core                     RUNNING   pid 127735, uptime 0:41:19
command_rtp                      RUNNING   pid 127456, uptime 0:41:45
command_vis                      RUNNING   pid 127450, uptime 0:41:47
database                         EXITED    May 09 10:52 AM
liveapp                          STOPPED   Not started
liveapp_dev                      STOPPED   Not started
webapp                           RUNNING   pid 127580, uptime 0:41:37
webapp_dev                       STOPPED   Not started

Prior to the last crash last night, I got the following in the database.log file:

2022-05-09T23:01:31.589+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 1592, after globalLock: 1592, after locks: 1592, after network: 1592, after opLatencies: 1592, after opcounters: 1592, after
 opcountersRepl: 1592, after repl: 1592, after storageEngine: 1592, after tcmalloc: 1592, after wiredTiger: 1592, at end: 1592 }
2022-05-09T23:29:07.373+0200 F -        [conn12] Invalid access at address: 0x558b1f76bedd
2022-05-09T23:29:07.400+0200 F -        [conn12] Got signal: 7 (Bus error).

 0x558b20172ac1 0x558b20171cd9 0x558b20172346 0x7f2366ed9630 0x558b1f76bedd 0x558b1f76c0c8 0x558b1f75d47c 0x558b1f75d4c8 0x558b1f75d508 0x558b1f75d597 0x558b1f780a70 0x558b1f78157a 0x558b1f781bbb 0x558b1f781cbc 0x558b1f6bd0c5 0x558b1f69405f 0x558b1f695741 0x558b1fcaa720 0x558b1f8ae8b2 0x558b1f8b08b6 0x558b1f4b1c7d 0x558b1f4b25ad 0x558b200f2b31 0x7f2366ed1ea5 0x7f2366bfa9fd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"558B1EC3F000","o":"1533AC1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"558B1EC3F000","o":"1532CD9"},{"b":"558B1EC3F000","o":"1533346"},{"b":"7F2366ECA000","o":"F630"},{"b":"558B1EC3F000","o":"B2CEDD","s":"_ZN5mongo10LockerImplILb0EE9lockBeginENS_10ResourceIdENS_8LockModeE"},{"b":"558B1EC3F000","o":"B2D0C8","s":"_ZN5mongo10LockerImplILb0EE16_lockGlobalBeginENS_8LockModeENS_8DurationISt5ratioILl1ELl1000EEEE"},{"b":"558B1EC3F000","o":"B1E47C","s":"_ZN5mongo4Lock10GlobalLock8_enqueueENS_8LockModeEj"},{"b":"558B1EC3F000","o":"B1E4C8","s":"_ZN5mongo4Lock10GlobalLockC1EPNS_6LockerENS_8LockModeEjNS1_11EnqueueOnlyE"},{"b":"558B1EC3F000","o":"B1E508","s":"_ZN5mongo4Lock10GlobalLockC2EPNS_6LockerENS_8LockModeEj"},{"b":"558B1EC3F000","o":"B1E597","s":"_ZN5mongo4Lock6DBLockC2EPNS_6LockerENS_10StringDataENS_8LockModeE"},{"b":"558B1EC3F000","o":"B41A70","s":"_ZN5mongo9AutoGetDbC1EPNS_16OperationContextENS_10StringDataENS_8LockModeE"},{"b":"558B1EC3F000","o":"B4257A","s":"_ZN5mongo17AutoGetCollectionC2EPNS_16OperationContextERKNS_15NamespaceStringENS_8LockModeES6_NS0_8ViewModeE"},{"b":"558B1EC3F000","o":"B42BBB","s":"_ZN5mongo24AutoGetCollectionForReadC1EPNS_16OperationContextERKNS_15NamespaceStringENS_17AutoGetCollection8ViewModeE"},{"b":"558B1EC3F000","o":"B42CBC","s":"_ZN5mongo30AutoGetCollectionOrViewForReadC1EPNS_16OperationContextERKNS_15NamespaceStringE"},{"b":"558B1EC3F000","o":"A7E0C5","s":"_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE"},{"b":"558B1EC3F000","o":"A5505F","s":"_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE"},{"b":"558B1EC3F000","o":"A56741","s":"_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE"},{"b":"558B1EC3F000","o":"106B720","s":"_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE"},{"b":"558B1EC3F000","o":"C6F8B2"},{"b":"558B1EC3F000","o":"C718B6","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"558B1EC3F000","o":"872C7D","s":"_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE"},{"b":"558B1EC3F000","o":"8735AD"},{"b":"558B1EC3F000","o":"14B3B31"},{"b":"7F2366ECA000","o":"7EA5"},{"b":"7F2366AFC000","o":"FE9FD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.10", "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-1160.25.1.el7.x86_64", "version" : "#1 SMP Wed Apr 28 21:49:45 UTC 2021", "machine" : "x86_64" }, "somap" : [ { "b" : "558B1EC3F000", "elfType" : 3, "buildId" : "D9AB5C91FBC6F740604F4BC28348FE33EC87DEC2" }, { "b" : "7FFF9D46D000", "elfType" : 3, "buildId" : "2B8B701C7F88CF0CBFE440A7E699428A9DCD8C29" }, { "b" : "7F2367A0A000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.7m.so", "elfType" : 3 }, { "b" : "7F2367F11000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libtiff.so", "elfType" : 3 }, { "b" : "7F2367802000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "3E44DF7055942478D052E40FDD1F5B7862B152B0" }, { "b" : "7F23675FE000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "7F2E9CB0769D7E57BD669B485A74B537B63A57C4" }, { "b" : "7F23672FC000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "7615604EAF4A068DFAE5085444D15C0DEE93DFBD" }, { "b" : "7F23670E6000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "EDF51350C7F71496149D064AA8B1441F786DF88A" }, { "b" : "7F2366ECA000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "E10CC8F2B932FC3DAEDA22F8DAC5EBB969524E5B" }, { "b" : "7F2366AFC000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "A317B42B15368ADCAE21C11107691A03EC91059D" }, { "b" : "7F2367D74000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "62C449974331341BB08DCCE3859560A22AF1E172" }, { "b" : "7F23668F9000", "path" : "/lib64/libutil.so.1", "elfType" : 3, "buildId" : "FF2196BD22A8443054C83031E0E76EB01BA1219C" }, { "b" : "7F2367E72000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/./libwebp.so.7", "elfType" : 3 }, { "b" : "7F2367DA7000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/./libzstd.so.1", "elfType" : 3 }, { "b" : "7F23668D0000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/./liblzma.so.5", "elfType" : 3 }, { "b" : "7F2366892000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/./libjpeg.so.9", "elfType" : 3 }, { "b" : "7F2366878000", "path" : "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/./libz.so.1", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x558b20172ac1]
 mongod(+0x1532CD9) [0x558b20171cd9]
 mongod(+0x1533346) [0x558b20172346]
 libpthread.so.0(+0xF630) [0x7f2366ed9630]
 mongod(_ZN5mongo10LockerImplILb0EE9lockBeginENS_10ResourceIdENS_8LockModeE+0x1BD) [0x558b1f76bedd]
 mongod(_ZN5mongo10LockerImplILb0EE16_lockGlobalBeginENS_8LockModeENS_8DurationISt5ratioILl1ELl1000EEEE+0xB8) [0x558b1f76c0c8]
 mongod(_ZN5mongo4Lock10GlobalLock8_enqueueENS_8LockModeEj+0x3C) [0x558b1f75d47c]
 mongod(_ZN5mongo4Lock10GlobalLockC1EPNS_6LockerENS_8LockModeEjNS1_11EnqueueOnlyE+0x38) [0x558b1f75d4c8]
 mongod(_ZN5mongo4Lock10GlobalLockC2EPNS_6LockerENS_8LockModeEj+0x18) [0x558b1f75d508]
 mongod(_ZN5mongo4Lock6DBLockC2EPNS_6LockerENS_10StringDataENS_8LockModeE+0x57) [0x558b1f75d597]
 mongod(_ZN5mongo9AutoGetDbC1EPNS_16OperationContextENS_10StringDataENS_8LockModeE+0x20) [0x558b1f780a70]
 mongod(_ZN5mongo17AutoGetCollectionC2EPNS_16OperationContextERKNS_15NamespaceStringENS_8LockModeES6_NS0_8ViewModeE+0x6A) [0x558b1f78157a]
 mongod(_ZN5mongo24AutoGetCollectionForReadC1EPNS_16OperationContextERKNS_15NamespaceStringENS_17AutoGetCollection8ViewModeE+0x4B) [0x558b1f781bbb]
 mongod(_ZN5mongo30AutoGetCollectionOrViewForReadC1EPNS_16OperationContextERKNS_15NamespaceStringE+0x2C) [0x558b1f781cbc]
 mongod(_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE+0x9A5) [0x558b1f6bd0c5]
 mongod(_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE+0x4FF) [0x558b1f69405f]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE+0xF81) [0x558b1f695741]
 mongod(_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE+0x240) [0x558b1fcaa720]
 mongod(+0xC6F8B2) [0x558b1f8ae8b2]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x746) [0x558b1f8b08b6]
 mongod(_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE+0x1FD) [0x558b1f4b1c7d]
 mongod(+0x8735AD) [0x558b1f4b25ad]
 mongod(+0x14B3B31) [0x558b200f2b31]
 libpthread.so.0(+0x7EA5) [0x7f2366ed1ea5]
 libc.so.6(clone+0x6D) [0x7f2366bfa9fd]
-----  END BACKTRACE  -----

Around the same time from command_core.log:

2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | Job Heartbeat check failed
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | Traceback (most recent call last):
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 1272, in _get_socket
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     sock_info = self.sockets.popleft()
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | IndexError: pop from an empty deque
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | 
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | During handling of the above exception, another exception occurred:
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | 
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | Traceback (most recent call last):
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 1180, in connect
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     sock = _configured_socket(self.address, self.opts)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 988, in _configured_socket
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     sock = _create_connection(address, options)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 972, in _create_connection
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     raise err
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 965, in _create_connection
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     sock.connect(sa)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | ConnectionRefusedError: [Errno 111] Connection refused
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | 
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | During handling of the above exception, another exception occurred:
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | 
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | Traceback (most recent call last):
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/cryosparc_command/command_core/__init__.py", line 237, in background_worker
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     check_heartbeats()
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2104, in check_heartbeats
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     'heartbeat_at' : {'$lt' : deadline} }, {'project_uid' : 1, 'uid' : 1}))
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1207, in next
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     if len(self.__data) or self._refresh():
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1124, in _refresh
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     self.__send_message(q)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1001, in __send_message
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     address=self.__address)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1372, in _run_operation_with_response
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     exhaust=exhaust)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1465, in _retryable_read
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     exhaust=exhaust) as (sock_info,
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/contextlib.py", line 112, in __enter__
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     return next(self.gen)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1309, in _slaveok_for_server
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     with self._get_socket(server, session, exhaust=exhaust) as sock_info:
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/contextlib.py", line 112, in __enter__
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     return next(self.gen)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1247, in _get_socket
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     self.__all_credentials, checkout=exhaust) as sock_info:
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/contextlib.py", line 112, in __enter__
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     return next(self.gen)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 1225, in get_socket
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     sock_info = self._get_socket(all_credentials)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 1275, in _get_socket
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     sock_info = self.connect(all_credentials)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 1187, in connect
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     _raise_connection_failure(self.address, error)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/pool.py", line 286, in _raise_connection_failure
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    |     raise AutoReconnect(msg)
2022-05-09 23:29:08,386 COMMAND.BG_WORKER    background_worker    ERROR    | pymongo.errors.AutoReconnect: SERVERNAME.TLD:10181: [Errno 111] Connection refused
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    | Job Heartbeat check failed
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    | Traceback (most recent call last):
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/cryosparc_command/command_core/__init__.py", line 237, in background_worker
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     check_heartbeats()
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/cryosparc_command/command_core/__init__.py", line 2104, in check_heartbeats
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     'heartbeat_at' : {'$lt' : deadline} }, {'project_uid' : 1, 'uid' : 1}))
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1207, in next
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     if len(self.__data) or self._refresh():
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1100, in _refresh
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     self.__session = self.__collection.database.client._ensure_session()
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1816, in _ensure_session
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     return self.__start_session(True, causal_consistency=False)
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1766, in __start_session
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     server_session = self._get_server_session()
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1802, in _get_server_session
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     return self._topology.get_server_session()
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 488, in get_server_session
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     None)
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 217, in _select_servers_loop
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    |     (self._error_message(selector), timeout, self.description))
2022-05-09 23:29:39,652 COMMAND.BG_WORKER    background_worker    ERROR    | pymongo.errors.ServerSelectionTimeoutError: SERVERNAME.TLD:10181: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 6278d8fbb91a6cd24e181aeb, topology_type: Single, servers: [<ServerDescription ('SERVERNAME.TLD', 10181) server_type: Unknown, rtt: None, error=AutoReconnect('SERVERNAME.TLD:10181: [Errno 111] Connection refused')>]>
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    | JSONRPC ERROR at get_num_active_licenses
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    | Traceback (most recent call last):
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/cryosparc_command/command_core/__init__.py", line 150, in wrapper
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     res = func(*args, **kwargs)
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/cryosparc_command/command_core/__init__.py", line 1741, in get_num_active_licenses
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     for j in jobs_running:
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1207, in next
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     if len(self.__data) or self._refresh():
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/cursor.py", line 1100, in _refresh
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     self.__session = self.__collection.database.client._ensure_session()
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1816, in _ensure_session
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     return self.__start_session(True, causal_consistency=False)
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1766, in __start_session
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     server_session = self._get_server_session()
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1802, in _get_server_session
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     return self._topology.get_server_session()
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 488, in get_server_session
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     None)
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |   File "/MYPATH/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 217, in _select_servers_loop
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    |     (self._error_message(selector), timeout, self.description))
2022-05-09 23:30:09,768 COMMAND.MAIN         wrapper              ERROR    | pymongo.errors.ServerSelectionTimeoutError: SERVERNAME.TLD:10181: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 6278d8fbb91a6cd24e181aeb, topology_type: Single, servers: [<ServerDescription ('SERVERNAME.TLD', 10181) server_type: Unknown, rtt: None, error=AutoReconnect('SERVERNAME.TLD:10181: [Errno 111] Connection refused')>]>

This seems to get repeated on and on until I restart cryosparc.

This machine is only used for cryosparc and according to my logs, the last job (from a different instance of cryosparc) finished 2h earlier, the server was essentially idling for some hours. Although in the past crashes have happened at any possible time, often killing running jobs along the way. I have not been able to see any patterns regarding when instances crash, sometimes two times in a day, sometimes only every few days. In the last 24h every instance that ran crashed, but not at the same time.
I would be happy to provide any additional log outputs / information if needed and I would be very thankful about any helpful insights!
Best wishes,
Lukas

Update with some additional info:
cat cryosparc_worker/config.sh:

export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/usr/local/cuda"
export CRYOSPARC_DEVELOP=false

Workstation details: CentOS7, 3x RTX8000, 384GB RAM, 2x Xeon 6244

global config variables:


export CRYOSPARC_DB_PATH="/MYPATH/cryosparc_database"
export CRYOSPARC_BASE_PORT=10180
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_INSECURE=false
export CRYOSPARC_CLICK_WRAP=true

CRYOSPARC_FORCE_HOSTNAME=true

@Lukas Welcome to the forum.
How many instances are there running on this host?
Is it ensured that there are no overlaps between the various instances’ port ranges
$CRYOSPARC_BASE_PORT…$CRYOSPARC_BASEPORT+10
and between instances’ database directories
$CRYOSPARC_DB_PATH?
Have there been any changes on the host or network configurations, security-related or otherwise?

Hi,
Thanks for your help.
I think we are running between 3 and 8 instances at any time. We have a shell script for users to set up their instance, this makes sure that every user is assigned a specific port range (10 consecutive ports per user). $CRYOSPARC_DB_PATH is dependent on the user name, these directories are also only accessible to the specific user. I think this should rule out any clashes regarding DB paths.
I will double check with the IT admins to make sure there have been no changes that might affect cryosparc, but I am not aware of any changes. In principal the admins know of what we are doing on this system and they were heavily involved in the setup and are aware of our cryosparc issues.

Best regards,
Lukas

Based on the timestamps, the database error preceded the error from command_core.
In addition to eliminating a possible interference between the various cryoSPARC instances, other concerns are

  • the capacity of the server to cope with the combined workloads
  • robustness, capacity and bandwidth of the infrastructure (network, disks, file server protocol, etc.) that supports storage of the multiple databases

serverStatus was very slow could point toward some resource starvation.

1 Like

Can it be determined why the server status is slow?

after extra_info: 1592,

I guess this is the step where some latency occurs that causes some response to be slow?

I have not been able to correlate these crashes to any system utilization paterns. The specific crash happened when the system load was as low as it gets. All jobs on this machine had finished some hours before, there was no high memory usage etc. But other times instances crash when there is normal usage.

This determination may be straightforward for someone with more mongodb experience than my own.
You may want to rule out issues that mongodb may experience on NUMA systems. I do not even know if your hardware is running in NUMA mode. Just in case…
I also discussed

with someone in our team who mentioned as possible causes:

  • mongodb or kernel bugs
  • defective RAM
  • disk corruption

and

  • inspection of the system log (for hardware or IO errors)
  • filesystem and memory checks

for additional diagnostics.
Is it always Got signal: 7 (Bus error) when the database fails on the various instances?

1 Like