MongoDB connection error

Hi

I am using the cryoSPARC v 4.7.1. The cryoSPARC frequently report the pymongo.errors.NetworkTimeout when I ran 2D classification.

Traceback (most recent call last):
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1695, in connect
sock = _configured_socket(self.address, self.opts)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1313, in _configured_socket
sock = _create_connection(address, options)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1297, in _create_connection
raise err
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1290, in _create_connection
sock.connect(sa)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2306, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 136, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 137, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 670, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.class2D_engine_run.work
File “cryosparc_master/cryosparc_compute/jobs/class2D/newrun.py”, line 232, in cryosparc_master.cryosparc_compute.jobs.class2D.newrun.run_class_2D.progress
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1826, in update_event_text
db[‘events’].update_one({‘_id’:event_id},
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/collection.py”, line 1077, in update_one
self._update_retryable(
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/collection.py”, line 872, in _update_retryable
return self.__database.client._retryable_write(
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py”, line 1575, in _retryable_write
return self._retry_with_session(retryable, func, s, bulk, operation, operation_id)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py”, line 1461, in _retry_with_session
return self._retry_internal(
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/_csot.py”, line 108, in csot_wrapper
return func(self, *args, **kwargs)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py”, line 1507, in _retry_internal
).run()
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py”, line 2353, in run
return self._read() if self._is_read else self._write()
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py”, line 2457, in _write
with self._client._checkout(self._server, self._session) as conn:
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/mongo_client.py”, line 1266, in _checkout
with server.checkout(handler=err_handler) as conn:
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1767, in checkout
conn = self._get_conn(checkout_started_time, handler=handler)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1926, in _get_conn
conn = self.connect(handler=handler)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 1715, in connect
_raise_connection_failure(self.address, error, timeout_details=details)
File “/home/exacloud/gscratch/gouaux/gouaux-cs2.ohsu.edu/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/pymongo/pool.py”, line 419, in _raise_connection_failure
raise NetworkTimeout(msg) from error
pymongo.errors.NetworkTimeout: gouaux-cs2.ohsu.edu:61001: timed out (configured timeouts: connectTimeoutMS: 20000.0ms)

Welcome to the forum @Gaoqi . To help figure out the cause of the NetworkTimeout, please can you post outputs of the following commands

  1. On the CryoSPARC master computer:
    hostname -f
    curl 127.0.0.1:61001
    
  2. On the CryoSPARC worker node, which may or may not be the same computer as the CryoSPARC master, where the 2D classification failed:
    hostname -f
    curl gouaux-cs2.ohsu.edu:61001
    

Please also let us know if the errors are restricted job types or worker nodes, and whether affected job types sometimes work or always fail.

For master computer:
[gouaux-cryosparc@gouaux-cs2 cryosparc]$ curl 127.0.0.1:61001

It looks like you are trying to access MongoDB over HTTP on the native driver port.

[gouaux-cryosparc@gouaux-cs2 cryosparc]$ hostname -f
gouaux-cs2.ohsu.edu

For worker node:
[gouaux-cryosparc@condo-11-48 ~]$ curl gouaux-cs2.ohsu.edu:61001

It looks like you are trying to access MongoDB over HTTP on the native driver port.

[gouaux-cryosparc@condo-11-48 ~]$ hostname -f
condo-11-48.local

This error occurs on many kinds of jobs. It sometimes can be fixed through restart of job. Currently, we didn’t met this error. It occurs randomly.

Thanks for running those curl commands.

What is the output of the command
free -h on the gouaux-cs2 computer?

Do NetworkTimeout errors tend to occur when gouaux-cs2 or the network is particularly busy?

@Gaoqi

This might be related to our problems. We’ve adjusted the timeout, but it hasn’t been long enough to know if it has improved the issue.

Jobs Failing Sporadically - pymongo.errors.ServerSelectionTimeoutError

Thank you! I just met this error this morning again.

[gouaux-cryosparc@gouaux-cs2 cryosparc]$ free -h
total used free shared buff/cache available
Mem: 31Gi 8.6Gi 1.7Gi 12Mi 21Gi 22Gi
Swap: 0B 0B 0B

I didn’t notice tha network status when the error occurs. I will keep an eye on it next time.

Thanks for posting the RAM details. Approx. how many users are using this CryoSPARC instance concurrently at any given time? Does gouaux-cs2 serve any other purposes apart from being the master of a single CryoSPARC instance?

1-2 users might use CryoSPARC instance concurrently. gouaux-cs2 doesn’t serve other purposes.

@Gaoqi
Another user recently reported a similar error at the same institution/network domain:

You may want to check with your IT support if both CryoSPARC master installations share any infrastructure and whether they can identify a common cause for the errors, other than potential intermittent spikes in the master hosts’ RAM use during interactive jobs.

Thank you for your suggestion. I have checked with the IT support team. They have been investigating but haven’t identified the cause for the error since last week.

@Gaoqi @DXLee. You may want to investigate whether the port 61001-related timeouts coincide with transient network disruptions (with the help of your IT support) or with issues on the CryoSPARC master servers. For the latter, you may check whether the instance logs show unusual patterns around the time when jobs fail with timeout errors related to port 61001:

  1. look up the time of the job error (replace P99, J199 with actual project and job IDs):
    cryosparcm eventlog P99 J199 | tail -n40
    
  2. check for unexpected activity inside supervisord.log or unusually slow query times inside database.log that coincide with the time of the job failure. The logs can be accessed with the
    cryosparcm log supervisord | less and
    cryosparcm log database | less commands. If current logs have rotated after the time of the error, older logs (with suffixes .1.9) can be found in the cryosparc_master/run/ directory.

Thank you,

I’ve forwarded this advice to IT support. Hopefully they will able to track down network disruptions if present.

In my case, I haven’t been able to find anything suspicious in supervisord.log or database.log.